39
Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Embed Size (px)

Citation preview

Page 1: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R

Hong Tran, April 21, 2015

Page 2: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Laboratory for Interdisciplinary Statistical Analysis

Collaboration:

Visit our website to request personalized statistical advice and assistance with:

Designing Experiments • Analyzing Data • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, Minitab...)

LISA statistical collaborators aim to explain concepts in ways useful for your research.

Great advice right now: Meet with LISA before collecting your data.

All services are FREE for VT researchers. We assist with research—not class projects or homework.

LISA helps VT researchers benefit from the use of Statistics

www.lisa.stat.vt.edu

LISA also offers:

Educational Short Courses: Designed to help graduate students apply statistics in their researchWalk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions <30 mins. See our

website for additional times and locations.

Page 3: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Outline

1.What is CDA?

2.Contingency Table

3.Measures of Association

4.Test of Independence

5.What is GLM? When should we use it?

6.How to evaluate the GLM models?

7.Logistic Regression

8.Poisson Regression

Page 4: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

What is CDA?

Dependent Variable (Y)

Independent Variables (X)

Model

Continuous (Normal) Continuous Linear Regression

Continuous Categorical ANOVA

Continuous Mixed ANCOVA

Categorical Categorical CDA

Page 5: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Contingency Table

I rows for categories in X

J rows for categories in Y

Values in cell=possible outcomes

Page 6: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Example 1 of Contingency Table

the relationship between smoking and epidermoid/undifferentiated pulmonary carcinoma (cancer)

Cohort study conducted

2x2 contingency table

Does smoking increase the risk of having epidermoid/undifferentiated pulmonary carcinoma?

Page 7: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Generating Contingency Table in R

Input the 2×2 table in R as a 2×2 matrix

Change the matrix to table using the function as.table(), because some functions are happier with tables than matrices

Page 8: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Measure of Association

Continuous Variables-Pearson Correlation Coefficient

Ordinal Variables-Pearson Correlation Coefficient

Nominal Variables-Phi Coefficient and Cramer’s V

Page 9: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Pearson Correlation

Page 10: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Pearson Correlation Example 2

mtcars in R

1974 Motor Trend US magazine

mpg: miles per gallonwt: weightdrat: rare axle ratio

Page 11: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Phi Coefficient

measures the association between two binary variables.

Its value ranges from -1 to +1, where +1/-1 indicates perfect positive association/negative association, 0 indicates no association.

The square of the phi coefficient is related to the chi-squared statistic for a 2×2 contingency table.

Page 12: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Cramer’s V

Cramer’s V measures the association between two nominal variables.

It varies from 0 (no association) to 1 (complete association) and can reach 1 only when the two variables are equal to each other.

Page 13: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Measures of Association

Comments:

1, When the two variables are binary, Cramer’s V is the same as Phi Coefficient

2, In R, under library(psych), use function phi() for Phi Coefficient

3, In R, under library(vcd), use function assocstats() for Cramer’s V

Page 14: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Test of Independence

Large Sample Size

Chi-square Test

Small Sample Size

Fisher’s Exact Test

Page 15: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Test of Independence (Chi-square Test)

Page 16: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Back to Example 1

Cases Control Total

Smoke 18/313 13/313 31/313

Non-smoker 46/313 236/313 282/313

Total 64/313 249/313 1

Page 17: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Test of Independence (Chi-square Test)

Page 18: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Test of Independence (Fisher’s Exact Test)

When any of the expected counts fall below 5, Chi-square test is not appropriate. Instead, we use Fisher’s Exact Test.

Example 3: The following data are from a Stanford University study of the effectiveness of the antidepressant Celexain the treatment of compulsive shopping.

Worse Same Better

Celexain 2 3 7

Placebo 2 8 2

Page 19: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Test of Independence

Chi-Square Test

Use R function chisq.test()

Fisher’s Exact Test

Use R function fisher.test()

Page 20: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Generalized Linear Models

When the response variables are not continuous, not normally distributed

Count numbers: 1, 2, 3,…

Binary: 0 and 1

Page 21: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Comparison

General Linear Model

Generalized Linear Model

Special cases ANOVA, ANCOVA, MANOVA, MANCOVA,

linear regression, mixed model

Linear regression, logistic regression,

Poisson regression

Function in R lm glm

Typical method estimation

Least Square Maximum Likelihood

Page 22: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Ordinary Linear Regression Ordinary Linear Regression (OLR) investigates and models

the linear relationship between independent variables and dependent variables that are continuous.

The simplest regression is Simple Linear regression, which models the linear relationship between a single independent variable and a single dependent variable.

Simple Linear Regression Model:

Page 23: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Assumptions in OLRThe assumptions are:

The true relationship between x and y is linear.

The errors are normally distributed with mean zero and unknown common variance .

The errors are uncorrelated.

The possible approaches when the assumptions of a normally distributed dependent variable with constant variance are violated:

Data transformations

Weighted least squares

Generalized linear model (GLM)

Page 24: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

GLM Model

𝑔 function is called the link function because it connects the mean and the linear predictor 𝜇 𝑥

Dependent variable’s distribution must come from the Exponential Family of Distributions

Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc.

3 Components

Random: Identifies dependent Y and its probability distribution

Systematic: Independent variables in a linear predictor function

Link function: Invertible function .that links the mean of the 𝑔dependent variable to the systematic component.

Page 25: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Response Distribution

Page 26: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Types of GLMs

Page 27: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

GLM and OLR

Ordinary linear regression is a special case of GLM

In OLR, the 3 components for GLM are:

Random: the dependent variable is normally distributed with mean and variance 𝜇

Systematic: Independent variables in a linear predictor function

Link function: Identity link ( )=𝑔 𝜇 𝜇Therefore, the GLM model for Ordinary linear regression is

Page 28: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Model Evaluation: Deviance

Deviance: measures how close the predicted values from the fitted model match the actual values from the raw data.

Definition:

Deviance = -2[log-likelihood(proposed model)-log-likelihood(saturated model)]

A saturated model is a model that fits the data perfectly, so its log-likelihood is the maximum. It has as many parameters as observations and hence it provides no simplification at all.

The deviance has a chi-squared asymptotic null distribution.

The degree of freedom is n-p, where n is the number of observations and p is the number of model parameters.

Smaller deviance, the better the model

Page 29: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Inference in GLM Goodness of Fit test

─ The null hypothesis is that the model is a good alternative to the saturated model.

─ Deviance is the Likelihood Ratio Statistic

Likelihood Ratio test

- Allows for the comparison of one model to another model by looking at the difference in deviance of the two models.

-Null Hypothesis: the predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit.

-Alternative Hypothesis: the predictor variables in Model 1 that are not found in Model 2 are significant to the model fit.

─ LRS is distributed as Chi-square distribution.

─ Simpler models have larger deviance.

Page 30: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Model Comparison in GLM

Two additional measures for model comparison are:

─ AkaikeInformation Criterion (AIC)

•Penalizes model for having many parameters

•AIC=-2logLikelihood+2*p where p is the number of parameters in the model

•The smaller AIC, the better the model

─ Bayesian Information Criterion (BIC)

•BIC=-2logLikelihood+ln(n)*p where p is the number of parameters in the model and n is the number of observations

•Usually stronger penalization for additional parameter than AIC

•The smaller BIC, the better the model

Page 31: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Summary

Setup of GLM

Inference in GLM

Deviance and Likelihood Ratio Test

─ Test goodness of fit for the proposed GLM model

─ Test the significance of a predictor variable or set of predictor variables in the model

Model Comparison in GLM

─ AIC

─ BIC

Page 32: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Logistic Regression

Logistic regression is a regression technique for predicting the outcome of a binary dependent variable.

Example: y=1-Success, 0-Failure

Random Component: the dependent variable follows a Bernoulli distribution

─ Probability of Success: 𝑝─ Probability of Failure: 1-𝑝─ The probability of obtaining y=1 or y=0 is given by Bernoulli Distribution:

─ Mean(Y): μ=𝑝

Page 33: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Logistic Regression

Page 34: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Logistic Regression

Page 35: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Steps for Logistic Regression in R

1.Create a single vector of 0’s and 1’s for the response variable.

2.Use the function glm() family=binomial to fit the model.

3.Test for goodness of fit and significance of predictors.

4.Interpretation

Page 36: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Poisson Regressions

Poisson regression is a regression technique for predicting the outcome of a count dependent variable.

Dependent variable measures the number of occurrences in a given time frame.

Outcomes equal to 0,1,2,…

Examples:

Number of penalties during a football game.

Number of customers shop at a grocery store on a given day.

Number of car accidents at an intersection during a period of time.

Page 37: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Poisson Regression

Page 38: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Poisson Regression

Page 39: Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015

Steps for Poisson Regression in R

1.Input data where y is a column of counts.

2.Use the function glm() family=poisson to fit the model.

3.Test for goodness of fit and significance of predictors.