Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Logistic Regressionfor binary outcomes

In Linear Regression, Y is continuous

In Logistic, Y is binary (0,1). Average Y is P.

Can’t use linear regression since:

1. Y can’t be linearly related to Xs.

2. Y does NOT have a Gaussian (normal)

distribution around “mean” P.

We need a “linearizing” transformation and a non Gaussian error model

Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P)Logit does not have a floor or ceiling. Model: logit =

ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk

or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit

Since P=odds/(1 + odds) & odds = elogit

P = elogit/(1 + elogit) = 1/(1 + e-logit)P vs logit

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-4 -3 -2 -1 0 1 2 3 4

logit =log odds

P=ri

sk

P vs logit

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-4 -3 -2 -1 0 1 2 3 4

logit=log oddsP

=ris

k

If ln(odds)= β0+ β1X1 + β2X2+…+βkXk

then

odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk)

or

odds = (base odds) OR1 OR2 … ORk

Model is multiplicative on the odds scale

(Base odds are odds when all Xs=0)

ORi = odds ratio for the ith X

Interpreting β coefficientsExample: Dichotomous X

X = 0 for males, X=1 for females

logit(P) = β0 + β1 X

M: X=0, logit(Pm)= β0

F: X=1, logit(Pf) = β0 + β1

logit(Pf) – logit(Pm) = β1

log(OR) = β1, eβ1 = OR

Example: P is proportion with disease

logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for FOR for F vs M for disease is eβ2 if both are

the same age.

eβ1 is the increase in the odds of disease for a one year increase in age.

(eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender.

Example: P is proportion with a MI

Predictors: age in years

htn = hypertension (1=yes, 0=no)

smoke = smoking (1=yes, 0=no)

Logit(P) = β0+ β1age + β2 htn + β3 smoke

Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension.

A:β0+β140+β2+β3smoke–(β0+β130+β3smoke)

= β110+β2=log OR. OR = e[10 β1+β2].

InteractionsP is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non

Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0

S D odds OR

0 0 eβ0 OR00=1= eβ0/ eβ0

1 0 eβ0+β1 OR10= eβ1

0 1 eβ0+β2 OR01= eβ2

1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3)

When will OR11=OR10 x OR01? IFF β3=0

Interpretation examplePotential predictors (13) of in hospital infection

mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148

Gender (female or male)

Age in years

APACHE score (0-129)

Diabetes (y/n)

Renal insufficiency / Hemodyalysis (y/n)

Intubation / mechanical ventilation (y/n)

Malignancy (y/n)

Steroid therapy (y/n)

Transfusions (y/n)

Organ transplant (y/n)

WBC - count

Max temperature - degrees

Days from admission to treatment (> 7 days)

Factors Associated With Mortality for All Infections

Characteristic Odds Ratio (95% CI) p value

Incr APACHE score 1.15 (1.11-1.18) <.001

Transfusion (y/n) 4.15 (2.46-6.99) <.001

Increasing age 1.03 (1.02-1.05) <.001

Malignancy 2.60 (1.62-4.17) <.001

Max Temperature 0.70 (0.58-0.85) <.001

Adm to treat>7 d 1.66 (1.05-2.61) 0.03

Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score

Diabetes complications -Descriptive stats

Table of obese by diabetes complicationobese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001

Fasting glucose (“fast glu”) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0Complication 69 75.0 114.0 155.9 353.0, p=

Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0Complication 69 60.0 257.0 261.5 480.0, p=

Diabetes complicationParameter DF beta SE(b) Chi-Square p

Intercept 1 -14.70 3.231 20.706 <.0001

obese 1 0.328 0.615 0.285 0.5938

Fast glu 1 0.108 0.031 2.456 0.0004

Steady glu 1 0.023 0.005 18.322 <.0001

Log odds diabetes complication =

-14.7+0.328 obese+0.108 fast glu + 0.023 steady glu

Statistical sig of the βs

Linear regr t = b/SE -> p value

Logistic regr Χ2 = (b/SE)2 -> p value

Must first form (95%) CI for β on log scale

b – 1.96 SE, b + 1.96 SE

Then take antilogs of each end

e[b – 1.96 SE], e[b + 1.96 SE]

Diabetes complications Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limitsobese e0.328=1.388 0.416 4.631Fast glu e0.108=1.114 1.049 1.182Steady glu e0.023=1.023 1.012 1.033

Model fit-Linear vs Logistic regression

k variables, n observations Variation df sum square or deviance

Model k G

Error n-k D

Total n-1 T <-fixed

Yi= ith observation, Ŷi=prediction for ith obs

statistic Linear regr Logistic regr

D/(n-k) Residual SDe Mean devianceΣ[(Yi-Ŷi)/Ŷ]2 -- Hosmer-L χ2

Corr(Y,Ŷ)2 R2 Cox-Snell R2

G/T R2 Pseudo R2

Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0.

There are two versions of the R2 for logistic regression.

Goodness of fit:Deviance Deviance in logistic is like SS in linear regr

df -2log L p value

Model (G) 3 117.21 < 0.001

Error (D) 141 83.46

total (T) 144 200.67

mean deviance =83.46/141=0.59

(want mean deviance to be ≤ 1)

R2pseudo=G/total =117/201= 0.58, R2

cs =0.554

Goodness of fit:H-L chi sqCompare observed vs model predicted

(expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946

Goodness of fit vs R2

Interpretation when goodness of fit is acceptable and R2 is poor.

Need to include interactions or make transformation on X variables in model?

Need to obtain more X variables?

Sensitivity & Specificity

Sensitivity=a/(a+c), false neg=c/(a+c)

Specificity=d/(b+d), false pos=b/(b+d)

Accuracy = W sensitivity + (1-W) specificity

True pos True neg

Classify pos a b

Classify neg c d

total a+c b+d

Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc,

Predict positive if P > Pc

Predict negative if P < Pc

Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu

+0.023 steady glu

Pi = 1/(1+ exp(-logit))

Compute Pi for all observations, find value of Pi (call it P0) that maximizes

accuracy=0.5 sensitivity + 0.5 specificity

This is an ROC analysis using the logit (or Pi)

ROC for logistic model

Diabetes model accuracy

True comp True no comp

Pred yes 55 11

Pred no 14 65

total 69 76

Sens=55/69= 79.7%, Spec=65/76=85.5%

Accuracy = (81.2% + 85.5%)/2 = 83.4%

Logit =0.447, P0=e0.447/(1+e0.447) = 0.61

C statistic (report this)

n0=num negative, n1=num positive

Make all n0 x n1 pairs (1,0)

Concordant if

predicted P for Y=1 > predicted P for Y=0

Discordant if

predicted P for Y=1 < predicted P for Y=0

C = num concordant + 0.5 num ties

n0 x n1

C=0.949 for diabetes complication model

Logistic model is also a discriminant model (LDA)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

-4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0

logit(P)

fre

q

Histograms of logit scores for each group

Poisson RegressionY is a low positive integer, 0, 1,2, …

Model:

ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk

so

mean Y = exp(β0+ β1X1 + β2X2+…+βkXk)

dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y

100 βi is the percent change per unit change in Xi

End

Equation for logit = log odds=depr “score”

logit = -1.8259 + 0.8332 female +

0.3578 chron ill -0.0299 income

odds depr = elogit, risk = odds/(1+odds)

coding:Female: 0 for M, 1 for F

Chron ill: 0 for no, 1 for yes

Income in 1000s

Example: Depression (y/n)

Model for depression

term coeff=β SE p value

Intercept -1.8259 0.4495 0.0001

female 0.8332 0.3882 0.0319

chron ill 0.3578 0.3300 0.2782

income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s

ORs term coeff=β OR = eβ

Intercept -1.8259 ---

female 0.8332 2.301

chron ill 0.3578 1.430

income -0.0299 0.971

Documents

Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression