Logistic and Poisson Regression: Modeling Binary and Count - LISA

March 3, 2009

Logistic and Poisson Regression: Modeling Binary and Count Data

Statistics Workshop Mark Seiss, Dept. of Statistics

Presentation Outline 1. Introduction to Generalized Linear Models

2. Binary Response Data - Logistic Regression Model

3. Count Response Data - Poisson Regression Model

4. Variable Significance – Likelihood Ratio Test

Reference Material   Short Course Presentation and Data from Examples

  www.lisa.stat.vt.edu/short_courses.php

  Categorical Data Analysis – Alan Agresti

  Examples found with SAS Code at www.stat.ufl.edu/~aa/cda/cda.html

  UCLA Statistical Consulting Website

  www.ats.ucla.edu/stat/

  Detailed examples of statistical analysis of data using SAS, SPSS, Stata, R, etc.

Generalized Linear Models

•  Generalized linear models (GLM) extend ordinary regression to non-normal response distributions.

•  Model •  for i = 1 to n

•  Why do we use GLM’s? •  Linear regression assumes that the response is distributed

normally •  GLM’s allow for analysis when it is not reasonable to assume

the data is distributed normally.

Generalized Linear Models •  Predictor Variables

•  Two Types: Continuous and Categorical •  Continuous Predictor Variables

•  Examples – Time, Grade Point Average, Test Score, etc. •  Coded with one parameter

•  Categorical Predictor Variables •  Examples – Sex, Political Affiliation, Marital Status, etc. •  Actual value assigned to Category not important •  Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc. •  Coded Differently than continuous variables

Generalized Linear Models •  Categorical Predictor Variables cont.

•  Consider a categorical predictor variable with L categories •  One category selected as reference category

•  Assignment of Reference Category is arbitrary •  Variable represented by L-1 dummy variables

•  Model Identifiability •  Two types of coding – Dummy and Effect

Generalized Linear Models •  Summary •  Generalized Linear Models •  Continuous and Categorical Predictor Variables

Generalized Linear Models

•  Questions/Comments

Logistic Regression •  Consider a binary response variable.

•  Variable with two outcomes •  One outcome represented by a 1 and the other represented

by a 0 •  Examples:

Does the person have a disease? Yes or No Who is the person voting for? McCain or Obama Outcome of a baseball game? Win or loss

Logistic Regression •  Logistic Regression Example Data Set

•  Response Variable –> Admission to Grad School (Admit) •  0 if admitted, 1 if not admitted

•  Predictor Variables •  GRE Score (gre)

– Continuous •  University Prestige (topnotch)

–  1 if prestigious, 0 otherwise •  Grade Point Average (gpa)

– Continuous

Logistic Regression •  First 10 Observations of the Data Set

ADMIT GRE TOPNOTCH GPA 1 380 0 3.61 0 660 1 3.67 0 800 1 4 0 640 0 3.19 1 520 0 2.93 0 760 0 3 0 560 0 2.98 1 400 0 3.08 0 540 0 3.39 1 700 1 3.92

Logistic Regression •  Consider the logistic regression model

•  GLM with binomial random component and logit link g(µ) = logit(µ)

•  Range of values for π(Xi) is 0 to 1

Logistic Regression •  Interpretation of Coefficient β – Odds Ratio

•  The odds ratio is a statistic that measures the odds of an event compared to the odds of another event.

•  Say the probability of Event 1 is π1 and the probability of Event 2 is π2 . Then the odds ratio of Event 1 to Event 2 is:

•  Value of Odds Ratio range from 0 to Infinity •  Value between 0 and 1 indicate the odds of Event 2 are greater •  Value between 1 and infinity indicate odds of Event 1 are greater •  Value equal to 1 indicates events are equally likely

Logistic Regression •  Interpretation of Coefficient β – Odds Ratio cont.

•  From our logistic regression model with a single continuous variable, the ratio of the odds of Y=0 for X+1 and X is

•  From our logistic regression model with a single two category variable with effect coding, the ratio of the odds of Y=0 from one category to another is

Logistic Regression •  Single Continuous Predictor Variable - GPA

Generalized Linear Model Fit

Response: Admit

Modeling P(Admit=0)

Distribution: Binomial

Link: Logit

Observations (or Sum Wgts) = 400

Whole Model Test

Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq

Difference 6.50444839 13.0089 1 0.0003

Full 243.48381

Reduced 249.988259

Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq

Pearson 401.1706 398 0.4460

Deviance 486.9676 398 0.0015

Logistic Regression •  Single Continuous Predictor Variable – GPA cont.

Effect Tests

Source DF L-R ChiSquare Prob>ChiSq

GPA 1 13.008897 0.0003

Parameter Estimates

Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL

Intercept -4.357587 1.0353175 19.117873 <.0001 -6.433355 -2.367383

GPA 1.0511087 0.2988695 13.008897 0.0003 0.4742176 1.6479411

Interpretation of the Parameter Estimate: Exp{1.0511087} = 2.86 = odds ratio between the odds at x+1 and odds at x for all x

The ratio of the odds of being admitted between a person with a 3.0 gpa and 2.0 gpa is equal to 2.86 or equivalently the odds of the person with the 3.0 is 2.86 times the odds of the person with the 2.0.

Logistic Regression •  Single Categorical Predictor Variable – Top Notch


Response: Admit

Modeling P(Admit=0)


Link: Logit


Whole Model Test


Difference 3.53984692 7.0797 1 0.0078

Full 246.448412

Reduced 249.988259


Pearson 400.0000 398 0.4624

Deviance 492.8968 398 0.0008

I

Logistic Regression •  Single Categorical Predictor Variable – Top Notch cont.

Effect Tests

Source DF L-R ChiSquare Prob>ChiSq

TOPNOTCH 1 7.0796939 0.0078

Parameter Estimates


Intercept -0.525855 0.138217 14.446085 0.0001 -0.799265 -0.255667

TOPNOTCH[0] -0.371705 0.138217 7.0796938 0.0078 -0.642635 -0.099011

Interpretation of the Parameter Estimate: Exp{2*-.371705} = 0.4755 = odds ratio between the odds of admittance for a student at a less prestigous university and the odds of admittance for a student from a more prestigous university.

The odds of being admitted from a less prestigous university is .48 times the odds of being admitted from a more prestigous university.

Logistic Regression •  Summary •  Introduction to the Logistic Regression Model •  Interpretation of the Parameter Estimates β – Odds

Ratio

Logistic Regression •  Questions/Comments

Poisson Regression •  Consider a count response variable.

•  Response variable is the number of occurrences in a given time frame.

•  Outcomes equal to 0, 1, 2, …. •  Examples:

Number of penalties during a football game. Number of customers shop at a store on a given day. Number of car accidents at an intersection.

Poisson Regression •  Poisson Regression Example Data Set

•  Response Variable –> Number of Days Absent – Integer •  Predictor Variables

•  Gender- 1 if Female, 2 if Male •  Ethnicity – 6 Ethnic Categories •  School – 1 if School, 2 if School 2 •  Math Test Score – Continuous •  Language Test Score – Continuous •  Bilingual Status – 4 Bilingual Categories

Poisson Regression •  First 10 Observations from the Poisson Regression Example

Data Set GENDER Ethnicity School Math Score Lang. Score Bilingual.status Days Absent

1 2 4 1 56.988830 42.45086 2 4

2 2 4 1 37.094160 46.82059 2 4

3 1 4 1 32.275460 43.56657 2 2

4 1 4 1 29.056720 43.56657 2 3

5 1 4 1 6.748048 27.24847 3 3

6 1 4 1 61.654280 48.41482 0 13

7 1 4 1 56.988830 40.73543 2 11

8 2 4 1 10.390490 15.35938 2 7

9 2 4 1 50.527950 52.11514 2 10

10 2 6 1 49.472050 42.45086 0 9

Poisson Regression •  Consider the Poisson log-linear model

•  GLM with Poisson random component and log link g(µ) = log(µ) •  Predicted response values fall between 0 and +∞

Poisson Regression •  Interpretation of Coefficient β

•  From our Poisson regression model with a single continuous variable, the relationship between the predicted response at value x and value x+1 is

•  From our Poisson regression model with a single two category variable with effect coding, the relationship between the predicted response from one category to another is

Poisson Regression •  Single Continuous Predictor Variable – Math Score


Response: number days absent

Distribution: Poisson

Link: Log Observations (or Sum Wgts) = 316

Whole Model Test

Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 39.619507 79.2390 1 <.0001

Full 1595.98854

Reduced 1635.60805


Pearson 3080.403 314 0.0000

Deviance 2330.581 314 <.0001

Poisson Regression •  Single Continuous Predictor Variable – Math Score

Effect Tests Source DF L-R ChiSquare Prob>ChiSq

ctbs math nce 1 79.239014 <.0001

Parameter Estimates


Intercept 2.3020999 0.0627765 1044.4013 <.0001 2.1780081 2.424086

ctbs math nce -0.011568 0.0012941 79.239014 <.0001 -0.014101 -0.009029

Interpretation of the parameter estimate:

Exp{-0.011568} = .98 = multiplicative effect on the expected number of days absent for an increase of 1 in the Math Score

Fabricated Example – If a student is expected to miss 5 days with a math score of 50, then another student with a math score of 51 is expected to miss 5*.98 = 4.9 days

Poisson Regression •  Single Continuous Predictor Variable – Gender


Response: number days absent

Distribution: Poisson

Link: Log


Whole Model Test


Difference 22.6810514 45.3621 1 <.0001

Full 1612.927

Reduced 1635.60805


Pearson 2877.292 314 0.0000

Deviance 2364.458 314 <.0001

Poisson Regression •  Single Continuous Predictor Variable – Gender

Effect Tests Source DF L-R ChiSquare Prob>ChiSq

GENDER 1 45.362103 <.0001

Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL

Intercept 1.743096 0.023734 3155.5494 0.0000 1.6962023 1.7892445

GENDER[1] 0.1586429 0.023734 45.362103 <.0001 0.1122479 0.2053005

Interpretation of the parameter estimate:

Exp{2*0.1586} = 1.3733 = multiplicative effect on the expected number of days absent of being female rather than male

If a male student is expected to miss X days, then a female student is expected to miss 1.3733*X.

Poisson Regression •  Summary •  Introduction to the Poisson Regression Model •  Interpretation of β

Likelihood Ratio Test •  Deviance

•  Let L(µ|y) = maximum of the log likelihood for the model L(y|y) = maximum of the log likelihood for the saturated

model •  Deviance = D(y| µ) = -2 [L(µ|y) - L(y|y) ] •  Tests the null hypothesis that the model is a good alternative

to the observed values •  Deviance has an asymptotic chi-squared distribution with N –

p degrees of freedom, where p is the number of parameters in the model.

Likelihood Ratio Test •  Nested Models

•  Model 1 - model with p predictor variables {X1, X2, X3,….,Xp} and vector of fitted values µ1

•  Model 2 - model with q<p predictor variables {X1, X2, X3,….,Xq} and vector of fitted values µ2

•  Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1.

•  i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1

•  Model 2 is a special case of Model 1 - all the coefficients associated with Xp+1, Xp+2, Xp+3,….,Xq are equal to zero

Likelihood Ratio Test •  Likelihood Ratio Test

•  Null Hypothesis: There is not a significant difference between the fit of two models.

•  Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit.

•  Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit.

•  Likelihood Ratio Statistic = -2* [L(y,u2)-L(y,u1)] = D(y,µ2) - D(y, µ1) Difference of the deviances of the two models •  Always D(y,µ2) > D(y,µ1) implies LRT > 0 •  LRT is distributed Chi-Squared with p-q degrees of freedom

Likelihood Ratio Test •  Theoretical Example of Likelihood Ratio Test

•  3 predictor variables – 1 Continuous (X1), 1 Categorical with 4 Categories (X2, X3, X4), 1 Categorical with 1 Category (X5)

•  Model 1 - predictor variables {X1, X2, X3, X4, X5} •  Model 2 - predictor variables {X1, X5} •  Null Hypothesis – Variables with 4 categories is not significant

to the model (β2 = β3 = β4 = 0) •  Alternate Hypothesis - Variable with 4 categories is significant •  Likelihood Ratio Statistic = D(y,µ2) - D(y, µ1)

•  Difference of the deviance statistics from the two models •  Chi-Squared Distribution with 5-2=3 degrees of freedom

Likelihood Ratio Test •  Likelihood Ratio Test

•  Consider the model with GPA, GRE, and Top Notch as predictor variables Generalized Linear Model Fit

Response: Admit

Modeling P(Admit=0)


Link: Logit


Whole Model Test


Difference 10.9234504 21.8469 3 <.0001

Full 239.064808

Reduced 249.988259


Pearson 396.9196 396 0.4775

Deviance 478.1296 396 0.0029

• 

Likelihood Ratio Test •  Variable Selection– Likelihood Ratio Test cont.

Effect Tests Source DF L-R ChiSquare Prob>ChiSq TOPNOTCH 1 2.2143635 0.1367 GPA 1 4.2909753 0.0383 GRE 1 5.4555484 0.0195

Parameter Estimates


Intercept -4.382202 1.1352224 15.917859 <.0001 -6.657167 -2.197805

TOPNOTCH[0] -0.218612 0.1459266 2.2143635 0.1367 -0.503583 0.070142

GPA 0.6675556 0.3252593 4.2909753 0.0383 0.0356956 1.3133755

GRE 0.0024768 0.0010702 5.4555484 0.0195 0.0003962 0.0046006

Likelihood Ratio Test •  Questions/Comments

Documents

Logistic and Poisson Regression: Modeling Binary and Count - LISA