Upload
preston-sherman
View
244
Download
4
Tags:
Embed Size (px)
Citation preview
Logistic Regressionfor binary outcomes
In Linear Regression, Y is continuous
In Logistic, Y is binary (0,1). Average Y is P.
Can’t use linear regression since:
1. Y can’t be linearly related to Xs.
2. Y does NOT have a Gaussian (normal)
distribution around “mean” P.
We need a “linearizing” transformation and a non Gaussian error model
Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P)Logit does not have a floor or ceiling. Model: logit =
ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk
or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit
Since P=odds/(1 + odds) & odds = elogit
P = elogit/(1 + elogit) = 1/(1 + e-logit)P vs logit
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-4 -3 -2 -1 0 1 2 3 4
logit =log odds
P=ri
sk
P vs logit
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-4 -3 -2 -1 0 1 2 3 4
logit=log oddsP
=ris
k
If ln(odds)= β0+ β1X1 + β2X2+…+βkXk
then
odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk)
or
odds = (base odds) OR1 OR2 … ORk
Model is multiplicative on the odds scale
(Base odds are odds when all Xs=0)
ORi = odds ratio for the ith X
Interpreting β coefficientsExample: Dichotomous X
X = 0 for males, X=1 for females
logit(P) = β0 + β1 X
M: X=0, logit(Pm)= β0
F: X=1, logit(Pf) = β0 + β1
logit(Pf) – logit(Pm) = β1
log(OR) = β1, eβ1 = OR
Example: P is proportion with disease
logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for FOR for F vs M for disease is eβ2 if both are
the same age.
eβ1 is the increase in the odds of disease for a one year increase in age.
(eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender.
Example: P is proportion with a MI
Predictors: age in years
htn = hypertension (1=yes, 0=no)
smoke = smoking (1=yes, 0=no)
Logit(P) = β0+ β1age + β2 htn + β3 smoke
Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension.
A:β0+β140+β2+β3smoke–(β0+β130+β3smoke)
= β110+β2=log OR. OR = e[10 β1+β2].
InteractionsP is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non
Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0
S D odds OR
0 0 eβ0 OR00=1= eβ0/ eβ0
1 0 eβ0+β1 OR10= eβ1
0 1 eβ0+β2 OR01= eβ2
1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3)
When will OR11=OR10 x OR01? IFF β3=0
Interpretation examplePotential predictors (13) of in hospital infection
mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148
Gender (female or male)
Age in years
APACHE score (0-129)
Diabetes (y/n)
Renal insufficiency / Hemodyalysis (y/n)
Intubation / mechanical ventilation (y/n)
Malignancy (y/n)
Steroid therapy (y/n)
Transfusions (y/n)
Organ transplant (y/n)
WBC - count
Max temperature - degrees
Days from admission to treatment (> 7 days)
Factors Associated With Mortality for All Infections
Characteristic Odds Ratio (95% CI) p value
Incr APACHE score 1.15 (1.11-1.18) <.001
Transfusion (y/n) 4.15 (2.46-6.99) <.001
Increasing age 1.03 (1.02-1.05) <.001
Malignancy 2.60 (1.62-4.17) <.001
Max Temperature 0.70 (0.58-0.85) <.001
Adm to treat>7 d 1.66 (1.05-2.61) 0.03
Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score
Diabetes complications -Descriptive stats
Table of obese by diabetes complicationobese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001
Fasting glucose (“fast glu”) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0Complication 69 75.0 114.0 155.9 353.0, p=
Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0Complication 69 60.0 257.0 261.5 480.0, p=
Diabetes complicationParameter DF beta SE(b) Chi-Square p
Intercept 1 -14.70 3.231 20.706 <.0001
obese 1 0.328 0.615 0.285 0.5938
Fast glu 1 0.108 0.031 2.456 0.0004
Steady glu 1 0.023 0.005 18.322 <.0001
Log odds diabetes complication =
-14.7+0.328 obese+0.108 fast glu + 0.023 steady glu
Statistical sig of the βs
Linear regr t = b/SE -> p value
Logistic regr Χ2 = (b/SE)2 -> p value
Must first form (95%) CI for β on log scale
b – 1.96 SE, b + 1.96 SE
Then take antilogs of each end
e[b – 1.96 SE], e[b + 1.96 SE]
Diabetes complications Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limitsobese e0.328=1.388 0.416 4.631Fast glu e0.108=1.114 1.049 1.182Steady glu e0.023=1.023 1.012 1.033
Model fit-Linear vs Logistic regression
k variables, n observations Variation df sum square or deviance
Model k G
Error n-k D
Total n-1 T <-fixed
Yi= ith observation, Ŷi=prediction for ith obs
statistic Linear regr Logistic regr
D/(n-k) Residual SDe Mean devianceΣ[(Yi-Ŷi)/Ŷ]2 -- Hosmer-L χ2
Corr(Y,Ŷ)2 R2 Cox-Snell R2
G/T R2 Pseudo R2
Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0.
There are two versions of the R2 for logistic regression.
Goodness of fit:Deviance Deviance in logistic is like SS in linear regr
df -2log L p value
Model (G) 3 117.21 < 0.001
Error (D) 141 83.46
total (T) 144 200.67
mean deviance =83.46/141=0.59
(want mean deviance to be ≤ 1)
R2pseudo=G/total =117/201= 0.58, R2
cs =0.554
Goodness of fit:H-L chi sqCompare observed vs model predicted
(expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946
Goodness of fit vs R2
Interpretation when goodness of fit is acceptable and R2 is poor.
Need to include interactions or make transformation on X variables in model?
Need to obtain more X variables?
Sensitivity & Specificity
Sensitivity=a/(a+c), false neg=c/(a+c)
Specificity=d/(b+d), false pos=b/(b+d)
Accuracy = W sensitivity + (1-W) specificity
True pos True neg
Classify pos a b
Classify neg c d
total a+c b+d
Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc,
Predict positive if P > Pc
Predict negative if P < Pc
Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu
+0.023 steady glu
Pi = 1/(1+ exp(-logit))
Compute Pi for all observations, find value of Pi (call it P0) that maximizes
accuracy=0.5 sensitivity + 0.5 specificity
This is an ROC analysis using the logit (or Pi)
ROC for logistic model
Diabetes model accuracy
True comp True no comp
Pred yes 55 11
Pred no 14 65
total 69 76
Sens=55/69= 79.7%, Spec=65/76=85.5%
Accuracy = (81.2% + 85.5%)/2 = 83.4%
Logit =0.447, P0=e0.447/(1+e0.447) = 0.61
C statistic (report this)
n0=num negative, n1=num positive
Make all n0 x n1 pairs (1,0)
Concordant if
predicted P for Y=1 > predicted P for Y=0
Discordant if
predicted P for Y=1 < predicted P for Y=0
C = num concordant + 0.5 num ties
n0 x n1
C=0.949 for diabetes complication model
Logistic model is also a discriminant model (LDA)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
-4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
logit(P)
fre
q
Histograms of logit scores for each group
Poisson RegressionY is a low positive integer, 0, 1,2, …
Model:
ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk
so
mean Y = exp(β0+ β1X1 + β2X2+…+βkXk)
dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y
100 βi is the percent change per unit change in Xi
End
Equation for logit = log odds=depr “score”
logit = -1.8259 + 0.8332 female +
0.3578 chron ill -0.0299 income
odds depr = elogit, risk = odds/(1+odds)
coding:Female: 0 for M, 1 for F
Chron ill: 0 for no, 1 for yes
Income in 1000s
Example: Depression (y/n)
Model for depression
term coeff=β SE p value
Intercept -1.8259 0.4495 0.0001
female 0.8332 0.3882 0.0319
chron ill 0.3578 0.3300 0.2782
income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s
ORs term coeff=β OR = eβ
Intercept -1.8259 ---
female 0.8332 2.301
chron ill 0.3578 1.430
income -0.0299 0.971