Upload
others
View
11
Download
1
Embed Size (px)
Citation preview
March 3, 2009
Logistic and Poisson Regression: Modeling Binary and Count Data
Statistics Workshop Mark Seiss, Dept. of Statistics
Presentation Outline 1. Introduction to Generalized Linear Models
2. Binary Response Data - Logistic Regression Model
3. Count Response Data - Poisson Regression Model
4. Variable Significance – Likelihood Ratio Test
Reference Material Short Course Presentation and Data from Examples
www.lisa.stat.vt.edu/short_courses.php
Categorical Data Analysis – Alan Agresti
Examples found with SAS Code at www.stat.ufl.edu/~aa/cda/cda.html
UCLA Statistical Consulting Website
www.ats.ucla.edu/stat/
Detailed examples of statistical analysis of data using SAS, SPSS, Stata, R, etc.
Generalized Linear Models
• Generalized linear models (GLM) extend ordinary regression to non-normal response distributions.
• Model • for i = 1 to n
• Why do we use GLM’s? • Linear regression assumes that the response is distributed
normally • GLM’s allow for analysis when it is not reasonable to assume
the data is distributed normally.
Generalized Linear Models • Predictor Variables
• Two Types: Continuous and Categorical • Continuous Predictor Variables
• Examples – Time, Grade Point Average, Test Score, etc. • Coded with one parameter
• Categorical Predictor Variables • Examples – Sex, Political Affiliation, Marital Status, etc. • Actual value assigned to Category not important • Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc. • Coded Differently than continuous variables
Generalized Linear Models • Categorical Predictor Variables cont.
• Consider a categorical predictor variable with L categories • One category selected as reference category
• Assignment of Reference Category is arbitrary • Variable represented by L-1 dummy variables
• Model Identifiability • Two types of coding – Dummy and Effect
Generalized Linear Models • Summary • Generalized Linear Models • Continuous and Categorical Predictor Variables
Generalized Linear Models
• Questions/Comments
Logistic Regression • Consider a binary response variable.
• Variable with two outcomes • One outcome represented by a 1 and the other represented
by a 0 • Examples:
Does the person have a disease? Yes or No Who is the person voting for? McCain or Obama Outcome of a baseball game? Win or loss
Logistic Regression • Logistic Regression Example Data Set
• Response Variable –> Admission to Grad School (Admit) • 0 if admitted, 1 if not admitted
• Predictor Variables • GRE Score (gre)
– Continuous • University Prestige (topnotch)
– 1 if prestigious, 0 otherwise • Grade Point Average (gpa)
– Continuous
Logistic Regression • First 10 Observations of the Data Set
ADMIT GRE TOPNOTCH GPA 1 380 0 3.61 0 660 1 3.67 0 800 1 4 0 640 0 3.19 1 520 0 2.93 0 760 0 3 0 560 0 2.98 1 400 0 3.08 0 540 0 3.39 1 700 1 3.92
Logistic Regression • Consider the logistic regression model
• GLM with binomial random component and logit link g(µ) = logit(µ)
• Range of values for π(Xi) is 0 to 1
Logistic Regression • Interpretation of Coefficient β – Odds Ratio
• The odds ratio is a statistic that measures the odds of an event compared to the odds of another event.
• Say the probability of Event 1 is π1 and the probability of Event 2 is π2 . Then the odds ratio of Event 1 to Event 2 is:
• Value of Odds Ratio range from 0 to Infinity • Value between 0 and 1 indicate the odds of Event 2 are greater • Value between 1 and infinity indicate odds of Event 1 are greater • Value equal to 1 indicates events are equally likely
Logistic Regression • Interpretation of Coefficient β – Odds Ratio cont.
• From our logistic regression model with a single continuous variable, the ratio of the odds of Y=0 for X+1 and X is
• From our logistic regression model with a single two category variable with effect coding, the ratio of the odds of Y=0 from one category to another is
Logistic Regression • Single Continuous Predictor Variable - GPA
Generalized Linear Model Fit
Response: Admit
Modeling P(Admit=0)
Distribution: Binomial
Link: Logit
Observations (or Sum Wgts) = 400
Whole Model Test
Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq
Difference 6.50444839 13.0089 1 0.0003
Full 243.48381
Reduced 249.988259
Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq
Pearson 401.1706 398 0.4460
Deviance 486.9676 398 0.0015
Logistic Regression • Single Continuous Predictor Variable – GPA cont.
Effect Tests
Source DF L-R ChiSquare Prob>ChiSq
GPA 1 13.008897 0.0003
Parameter Estimates
Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL
Intercept -4.357587 1.0353175 19.117873 <.0001 -6.433355 -2.367383
GPA 1.0511087 0.2988695 13.008897 0.0003 0.4742176 1.6479411
Interpretation of the Parameter Estimate: Exp{1.0511087} = 2.86 = odds ratio between the odds at x+1 and odds at x for all x
The ratio of the odds of being admitted between a person with a 3.0 gpa and 2.0 gpa is equal to 2.86 or equivalently the odds of the person with the 3.0 is 2.86 times the odds of the person with the 2.0.
Logistic Regression • Single Categorical Predictor Variable – Top Notch
Generalized Linear Model Fit
Response: Admit
Modeling P(Admit=0)
Distribution: Binomial
Link: Logit
Observations (or Sum Wgts) = 400
Whole Model Test
Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq
Difference 3.53984692 7.0797 1 0.0078
Full 246.448412
Reduced 249.988259
Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq
Pearson 400.0000 398 0.4624
Deviance 492.8968 398 0.0008
I
Logistic Regression • Single Categorical Predictor Variable – Top Notch cont.
Effect Tests
Source DF L-R ChiSquare Prob>ChiSq
TOPNOTCH 1 7.0796939 0.0078
Parameter Estimates
Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL
Intercept -0.525855 0.138217 14.446085 0.0001 -0.799265 -0.255667
TOPNOTCH[0] -0.371705 0.138217 7.0796938 0.0078 -0.642635 -0.099011
Interpretation of the Parameter Estimate: Exp{2*-.371705} = 0.4755 = odds ratio between the odds of admittance for a student at a less prestigous university and the odds of admittance for a student from a more prestigous university.
The odds of being admitted from a less prestigous university is .48 times the odds of being admitted from a more prestigous university.
Logistic Regression • Summary • Introduction to the Logistic Regression Model • Interpretation of the Parameter Estimates β – Odds
Ratio
Logistic Regression • Questions/Comments
Poisson Regression • Consider a count response variable.
• Response variable is the number of occurrences in a given time frame.
• Outcomes equal to 0, 1, 2, …. • Examples:
Number of penalties during a football game. Number of customers shop at a store on a given day. Number of car accidents at an intersection.
Poisson Regression • Poisson Regression Example Data Set
• Response Variable –> Number of Days Absent – Integer • Predictor Variables
• Gender- 1 if Female, 2 if Male • Ethnicity – 6 Ethnic Categories • School – 1 if School, 2 if School 2 • Math Test Score – Continuous • Language Test Score – Continuous • Bilingual Status – 4 Bilingual Categories
Poisson Regression • First 10 Observations from the Poisson Regression Example
Data Set GENDER Ethnicity School Math Score Lang. Score Bilingual.status Days Absent
1 2 4 1 56.988830 42.45086 2 4
2 2 4 1 37.094160 46.82059 2 4
3 1 4 1 32.275460 43.56657 2 2
4 1 4 1 29.056720 43.56657 2 3
5 1 4 1 6.748048 27.24847 3 3
6 1 4 1 61.654280 48.41482 0 13
7 1 4 1 56.988830 40.73543 2 11
8 2 4 1 10.390490 15.35938 2 7
9 2 4 1 50.527950 52.11514 2 10
10 2 6 1 49.472050 42.45086 0 9
Poisson Regression • Consider the Poisson log-linear model
• GLM with Poisson random component and log link g(µ) = log(µ) • Predicted response values fall between 0 and +∞
Poisson Regression • Interpretation of Coefficient β
• From our Poisson regression model with a single continuous variable, the relationship between the predicted response at value x and value x+1 is
• From our Poisson regression model with a single two category variable with effect coding, the relationship between the predicted response from one category to another is
Poisson Regression • Single Continuous Predictor Variable – Math Score
Generalized Linear Model Fit
Response: number days absent
Distribution: Poisson
Link: Log Observations (or Sum Wgts) = 316
Whole Model Test
Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 39.619507 79.2390 1 <.0001
Full 1595.98854
Reduced 1635.60805
Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq
Pearson 3080.403 314 0.0000
Deviance 2330.581 314 <.0001
Poisson Regression • Single Continuous Predictor Variable – Math Score
Effect Tests Source DF L-R ChiSquare Prob>ChiSq
ctbs math nce 1 79.239014 <.0001
Parameter Estimates
Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL
Intercept 2.3020999 0.0627765 1044.4013 <.0001 2.1780081 2.424086
ctbs math nce -0.011568 0.0012941 79.239014 <.0001 -0.014101 -0.009029
Interpretation of the parameter estimate:
Exp{-0.011568} = .98 = multiplicative effect on the expected number of days absent for an increase of 1 in the Math Score
Fabricated Example – If a student is expected to miss 5 days with a math score of 50, then another student with a math score of 51 is expected to miss 5*.98 = 4.9 days
Poisson Regression • Single Continuous Predictor Variable – Gender
Generalized Linear Model Fit
Response: number days absent
Distribution: Poisson
Link: Log
Observations (or Sum Wgts) = 316
Whole Model Test
Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq
Difference 22.6810514 45.3621 1 <.0001
Full 1612.927
Reduced 1635.60805
Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq
Pearson 2877.292 314 0.0000
Deviance 2364.458 314 <.0001
Poisson Regression • Single Continuous Predictor Variable – Gender
Effect Tests Source DF L-R ChiSquare Prob>ChiSq
GENDER 1 45.362103 <.0001
Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL
Intercept 1.743096 0.023734 3155.5494 0.0000 1.6962023 1.7892445
GENDER[1] 0.1586429 0.023734 45.362103 <.0001 0.1122479 0.2053005
Interpretation of the parameter estimate:
Exp{2*0.1586} = 1.3733 = multiplicative effect on the expected number of days absent of being female rather than male
If a male student is expected to miss X days, then a female student is expected to miss 1.3733*X.
Poisson Regression • Summary • Introduction to the Poisson Regression Model • Interpretation of β
Likelihood Ratio Test • Deviance
• Let L(µ|y) = maximum of the log likelihood for the model L(y|y) = maximum of the log likelihood for the saturated
model • Deviance = D(y| µ) = -2 [L(µ|y) - L(y|y) ] • Tests the null hypothesis that the model is a good alternative
to the observed values • Deviance has an asymptotic chi-squared distribution with N –
p degrees of freedom, where p is the number of parameters in the model.
Likelihood Ratio Test • Nested Models
• Model 1 - model with p predictor variables {X1, X2, X3,….,Xp} and vector of fitted values µ1
• Model 2 - model with q<p predictor variables {X1, X2, X3,….,Xq} and vector of fitted values µ2
• Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1.
• i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1
• Model 2 is a special case of Model 1 - all the coefficients associated with Xp+1, Xp+2, Xp+3,….,Xq are equal to zero
Likelihood Ratio Test • Likelihood Ratio Test
• Null Hypothesis: There is not a significant difference between the fit of two models.
• Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit.
• Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit.
• Likelihood Ratio Statistic = -2* [L(y,u2)-L(y,u1)] = D(y,µ2) - D(y, µ1) Difference of the deviances of the two models • Always D(y,µ2) > D(y,µ1) implies LRT > 0 • LRT is distributed Chi-Squared with p-q degrees of freedom
Likelihood Ratio Test • Theoretical Example of Likelihood Ratio Test
• 3 predictor variables – 1 Continuous (X1), 1 Categorical with 4 Categories (X2, X3, X4), 1 Categorical with 1 Category (X5)
• Model 1 - predictor variables {X1, X2, X3, X4, X5} • Model 2 - predictor variables {X1, X5} • Null Hypothesis – Variables with 4 categories is not significant
to the model (β2 = β3 = β4 = 0) • Alternate Hypothesis - Variable with 4 categories is significant • Likelihood Ratio Statistic = D(y,µ2) - D(y, µ1)
• Difference of the deviance statistics from the two models • Chi-Squared Distribution with 5-2=3 degrees of freedom
Likelihood Ratio Test • Likelihood Ratio Test
• Consider the model with GPA, GRE, and Top Notch as predictor variables Generalized Linear Model Fit
Response: Admit
Modeling P(Admit=0)
Distribution: Binomial
Link: Logit
Observations (or Sum Wgts) = 400
Whole Model Test
Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq
Difference 10.9234504 21.8469 3 <.0001
Full 239.064808
Reduced 249.988259
Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq
Pearson 396.9196 396 0.4775
Deviance 478.1296 396 0.0029
•
Likelihood Ratio Test • Variable Selection– Likelihood Ratio Test cont.
Effect Tests Source DF L-R ChiSquare Prob>ChiSq TOPNOTCH 1 2.2143635 0.1367 GPA 1 4.2909753 0.0383 GRE 1 5.4555484 0.0195
Parameter Estimates
Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL
Intercept -4.382202 1.1352224 15.917859 <.0001 -6.657167 -2.197805
TOPNOTCH[0] -0.218612 0.1459266 2.2143635 0.1367 -0.503583 0.070142
GPA 0.6675556 0.3252593 4.2909753 0.0383 0.0356956 1.3133755
GRE 0.0024768 0.0010702 5.4555484 0.0195 0.0003962 0.0046006
Likelihood Ratio Test • Questions/Comments