33
Categorical Data

Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Embed Size (px)

Citation preview

Page 1: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Categorical Data

Page 2: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• To identify any association between two categorical data.

Example: 1,073 subjects of both genders were recruited for a study where the onset of severe chest pain is recorded for each subject.

Variables:

- Onset of severe chest pain (+ve / –ve)

- Gender (male / female)

Categorical Data Analysis

Page 3: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• Commonly denoted as 2

• Useful in testing for independence between categorical variables (e.g. genetic association between cases / controls)

Comparison of observed, against what is expected under the null hypothesis.

Assumptions

• Sufficiently large data in each cell in the cross-tabulation table.

Chi-square tests

K

i i

ii

E

EO

1

22 |)(|

Page 4: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• In general, require(a) Smallest expected count is 1 or more(b) At least 80% of the cells have an expected count of 5 or more

• Yate’s Continuity CorrectionProvides a better approximation of the test statistic when the data is dichotomous (2 2)

K

i i

ii

E

EO

1

22 )5.0|(|

Small Cell Counts

Page 5: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• Null hypothesis of a hypothesized distribution for the data.

• Expected frequencies calculated under the hypothesized distribution.

For example: The number of outbreaks of flu epidemics is charted over the period 1500 to 1931, and the number of outbreaks each year is tabulated. The variable of interest counts the number of outbreaks occurring in each year of that 432 year period. E.g. there were 223 years with no flu outbreaks.

Goodness-of-fit test

Page 6: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• Hypotheses:H0: Data follows a Poisson distribution with mean 0.692H1: Data does not follow a Poisson distribution with

mean 0.692

Note: Mean 0.692 is obtained from the sample mean.

Expected frequency for X = 0

= 432 P(X = 0), where X ~ Poisson(0.692)

Test Statistic , with df = (6 – 1).

This yields a p-value of 0.99, indicating that we will almost certainly be wrong if we reject the null hypothesis.

216

1

2

~)(

K

i i

ii

E

EO

Goodness-of-fit test

Sample mean = (0 x 223 + 1 x 142 + 2 x 48 + 3 x 15 + 4 x 4 + 5 x 0) / 432 = 0.692

Page 7: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Test of independence

Most common usage of the Pearson’s chi-square test.

H0: The two categorical variables are independentH1: The two categorical variables are associated (i.e. not independent)

Under the independence assumption, if outcome A is independent to outcome B, then

P(A and B happen jointly) = P(A happen) x P(B happen)

Page 8: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Calculating expected frequencies

P(Chest pain +) = 83/1073 P(Chest pain -) = 990/1073P(Males) = 520/1073 P(Females) = 553/1073

P(Males with chest pain +) = 83/1073 x 520/1073 = 0.0375Expected(Males with chest pain +) = 1073 x P(.)

= 1073 x 0.0375 = 40.224

Observed(Males with chest pain +) = 46

Page 9: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• Expected frequencies calculated by:

• Degrees of freedom = (r – 1) (c – 1)n

CRE jiij

Test of independence

Page 10: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Chi-square test

Page 11: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Chi-Square Tests

1.744b 1 .187

1.456 1 .228

1.745 1 .186

.209 .114

1.743 1 .187

1073

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher's Exact Test

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is40.22.

b.

Chi-square test

Looking at the validity of the assumption of sufficiently large sample sizes!

Page 12: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• 2-test identifies whether there is significant association between the two categorical variables.

• But does not quantify the strength and direction of the association.

• Need odds ratio to do this.

• Odds ratio defines “how many times more likely” it is to be in one category compared to the other:

Example: For the previous example on severe chest pain, males are about 1.4 times more likely to experience severe chest pains than females.

Quantification of the effect

Always know what is the outcome/event of interest, and what is the baseline reference! Otherwise OR can be interpreted both ways!

Page 13: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Pos. outcome Neg. outcome

Exposure (+) a b

Exposure (-) c d

bc

ad

dc

ba

db

caOR

)(

)(

)(

)(

dbb

caa

dcc

baaRR

Odds ratio and relative risk

Calculation of odds ratio is pretty straightforward. - Use the leading diagonal divided by the antidiagonal.

Relative risk is more tricky though, since it’s not symmetric! While it’s commonly used interchangeable with OR, the interpretation and calculation are very different!

Page 14: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Case-Control Study• Compare affected and unaffected individuals• Usually retrospective in nature• Temporal sequence cannot be established (timing for the onset

of the disease)• No information on population incidence of the disease

Cohort Study• Usually random sampling of subjects within the population• Prospective, retrospective or both• Long follow-up; loss to follow-up• Costly to conduct• Temporal sequence can be established• Provides information on population incidence of the disease

Exegesis on epidemiology

Odds ratio is the right metric here!

Relative risk is the appropriate metric here!

Page 15: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• Not straightforward to obtain confidence intervals of odds ratio (due to complexity in obtaining the variance)

• Straightforward to obtain the variance of the logarithm of odds ratio.

• Odds ratio is always reported together with the p-values (obtained from Pearson’s Chi-square test), and the corresponding confidence intervals.

2

2

1

1

ˆ1

ˆlog

ˆ1

ˆlog)log(

p

pVar

p

pVarORVar

dcba

1111

Confidence interval of odds ratio

Page 16: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Ca (+ve) Ca (-ve)

Smoking (+) 1,301 1,205

Smoking (-) 56 152

Odds and Odds Ratio

Odds Ratio (OR) = (1301/56)/(1205/152) = 2.93

Pearson’s Chi-square = 47.985, on df = 1 p-value = 0

Var[log(OR)] = = 0.026

95% Confidence interval=

= (2.14, 4.02)

152

1

1205

1

56

1

1301

1

026.096.1)93.2log(exp

Case study on smoking and lung cancer

Page 17: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

severe chest pain lasting 30 min or more * RACE Crosstabulation

47 14 22 83

6.2% 9.0% 13.5% 7.7%

707 142 141 990

93.8% 91.0% 86.5% 92.3%

754 156 163 1073

100.0% 100.0% 100.0% 100.0%

Count

% within RACE

Count

% within RACE

Count

% within RACE

yes

no

severe chest pain lasting30 min or more

Total

Chinese Malay Indian

RACE

Total

Chi-Square Tests

10.300a 2 .006

9.170 2 .010

10.156 1 .001

1073

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 12.07.

a.

Beyond 2 x 2 tables

Page 18: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Nominal or ordinal

For categorical variables with two possible outcomes: - Does not matter whether the variable is nominal or ordinal

For categorical variables with more than 2 outcomes:- Important to note whether the variable is nominal or ordinal- Test to use is very different, and thus conclusion reached can be very different.

Example: Consider the same dataset on severe chest pain, suppose we have the smoking status of every individual, classified into:- Non-smoker- Daily smoker- Excessive smoker

Smoking intensity

Page 19: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

severe chest pain lasting 30 min or more * smoking status Crosstabulation

53 19 9 81

6.7% 9.8% 13.2% 7.7%

736 174 59 969

93.3% 90.2% 86.8% 92.3%

789 193 68 1050

100.0% 100.0% 100.0% 100.0%

Count

% within smoking status

Count

% within smoking status

Count

% within smoking status

yes

no

severe chest pain lasting30 min or more

Total

non-smoker daily smoker Ex-smoker

smoking status

Total

Chi-Square Tests

5.243a 2 .073

4.724 2 .094

5.236 1 .022

1050

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 5.25.

a.

ORsmoker = 1.52 (0.88, 2.63), p = 0.180

ORex-smoker= 2.11 (1.00, 4.51), p = 0.081

with non-smoker as reference category.

Chi-square test for trend

Page 20: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Linear-by-linear association

Adopts a correlational approach by calculating the Pearson correlation coefficient between the rows and the columns, allowing for ordinal outcomes in either.

Recode rows as: yes = 0, no = 1.

Recode columns as: non-smoker = 0, daily smoker = 1, ex-smoker = 2

severe chest pain lasting 30 min or more * smoking status Crosstabulation

53 19 9 81

6.7% 9.8% 13.2% 7.7%

736 174 59 969

93.3% 90.2% 86.8% 92.3%

789 193 68 1050

100.0% 100.0% 100.0% 100.0%

Count

% within smoking status

Count

% within smoking status

Count

% within smoking status

yes

no

severe chest pain lasting30 min or more

Total

non-smoker daily smoker Ex-smoker

smoking status

Total

Page 21: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Chi-Square Tests

5.243a 2 .073

4.724 2 .094

5.236 1 .022

1050

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 5.25.

a.

Linear-by-linear association

53 observations

19 observations

Pearson Correlation = -0.0706

Consider the test statistic:T = (N – 1) r2 ~ Chi-square(1) = (1050 – 1) (-0.0706)2

= 5.2356

Page 22: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Nominal vs. Ordinal

Importance of recognising the kind of variables we have in order to identify the right test!

Chi-Square Tests

5.243a 2 .073

4.724 2 .094

5.236 1 .022

1050

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 5.25.

a.

Page 23: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• Summarise data using cross-tabulation tables, with percentages

• Recognise whether any of the variables are ordinal

• Perform a chi-square of independence to test for association between the two categorical variables, or the linear-by-linear test if there is at least one ordinal variable out of the two variables

• Check the validity of the assumption on the sample size

• Quantify any significant association using odds ratios

• Always report odds ratios with corresponding 95% confidence interval

Procedure for Categorical Data Analysis

Page 24: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Categorical Data Analysis in SPSS

Page 25: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Example: Let’s consider the lung cancer and smoking example:

1. Establish the relationship between the onset of lung cancer and smoking status. Quantify this relationship if it is statistically significant.

Ca (+ve) Ca (-ve)

Smoking (+) 1,301 1,205

Smoking (-) 56 152

Page 26: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Data entry

Slightly counter-intuitive, event of interest and outcome of interest should be coded as 0, and the baseline reference outcome/event coded as 1.

Page 27: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Define what 0 and 1 corresponds to:

Page 28: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the
Page 29: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the
Page 30: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the
Page 31: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Definition of 0s and 1s converted to what you specified under “Values”.

Percentages are much easier and more meaningful to interpret than absolute numbers!

Page 32: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

Highly significant, P < 0.001

Odds ratio of getting lung cancer with corresponding 95% CI, with non-smoker as baseline

Relative risk of getting lung cancer with corresponding 95% CI, with non-smoker as baseline

Page 33: Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the

• understand the use of a chi-square test for testing independence between two categorical outcomes

• understand the assumptions on sample sizes for the use of a chi-square test

• know how to quantify the association using odds ratio/relative risk, with corresponding 95% confidence intervals

• differentiate between the tests to be used for nominal categorical and ordinal categorical variables.

• perform the appropriate analyses in SPSS and RExcel

Students should be able to