32
Logistic Regression for binary outcomes

Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Embed Size (px)

Citation preview

Page 1: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Logistic Regressionfor binary outcomes

Page 2: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

In Linear Regression, Y is continuous

In Logistic, Y is binary (0,1). Average Y is P.

Can’t use linear regression since:

1. Y can’t be linearly related to Xs.

2. Y does NOT have a Gaussian (normal)

distribution around “mean” P.

We need a “linearizing” transformation and a non Gaussian error model

Page 3: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P)Logit does not have a floor or ceiling. Model: logit =

ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk

or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit

Page 4: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Since P=odds/(1 + odds) & odds = elogit

P = elogit/(1 + elogit) = 1/(1 + e-logit)P vs logit

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

-4 -3 -2 -1 0 1 2 3 4

logit =log odds

P=ri

sk

P vs logit

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-4 -3 -2 -1 0 1 2 3 4

logit=log oddsP

=ris

k

Page 5: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

If ln(odds)= β0+ β1X1 + β2X2+…+βkXk

then

odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk)

or

odds = (base odds) OR1 OR2 … ORk

Model is multiplicative on the odds scale

(Base odds are odds when all Xs=0)

ORi = odds ratio for the ith X

Page 6: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Interpreting β coefficientsExample: Dichotomous X

X = 0 for males, X=1 for females

logit(P) = β0 + β1 X

M: X=0, logit(Pm)= β0

F: X=1, logit(Pf) = β0 + β1

logit(Pf) – logit(Pm) = β1

log(OR) = β1, eβ1 = OR

Page 7: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Example: P is proportion with disease

logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for FOR for F vs M for disease is eβ2 if both are

the same age.

eβ1 is the increase in the odds of disease for a one year increase in age.

(eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender.

Page 8: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Example: P is proportion with a MI

Predictors: age in years

htn = hypertension (1=yes, 0=no)

smoke = smoking (1=yes, 0=no)

Logit(P) = β0+ β1age + β2 htn + β3 smoke

Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension.

A:β0+β140+β2+β3smoke–(β0+β130+β3smoke)

= β110+β2=log OR. OR = e[10 β1+β2].

Page 9: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

InteractionsP is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non

Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0

S D odds OR

0 0 eβ0 OR00=1= eβ0/ eβ0

1 0 eβ0+β1 OR10= eβ1

0 1 eβ0+β2 OR01= eβ2

1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3)

When will OR11=OR10 x OR01? IFF β3=0

Page 10: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Interpretation examplePotential predictors (13) of in hospital infection

mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148

Gender (female or male)

Age in years

APACHE score (0-129)

Diabetes (y/n)

Renal insufficiency / Hemodyalysis (y/n)

Intubation / mechanical ventilation (y/n)

Malignancy (y/n)

Steroid therapy (y/n)

Transfusions (y/n)

Organ transplant (y/n)

WBC - count

Max temperature - degrees

Days from admission to treatment (> 7 days)

Page 11: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Factors Associated With Mortality for All Infections

Characteristic Odds Ratio (95% CI) p value

Incr APACHE score 1.15 (1.11-1.18) <.001

Transfusion (y/n) 4.15 (2.46-6.99) <.001

Increasing age 1.03 (1.02-1.05) <.001

Malignancy 2.60 (1.62-4.17) <.001

Max Temperature 0.70 (0.58-0.85) <.001

Adm to treat>7 d 1.66 (1.05-2.61) 0.03

Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score

Page 12: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Diabetes complications -Descriptive stats

Table of obese by diabetes complicationobese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001

Fasting glucose (“fast glu”) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0Complication 69 75.0 114.0 155.9 353.0, p=

Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0Complication 69 60.0 257.0 261.5 480.0, p=

Page 13: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Diabetes complicationParameter DF beta SE(b) Chi-Square p

Intercept 1 -14.70 3.231 20.706 <.0001

obese 1 0.328 0.615 0.285 0.5938

Fast glu 1 0.108 0.031 2.456 0.0004

Steady glu 1 0.023 0.005 18.322 <.0001

Log odds diabetes complication =

-14.7+0.328 obese+0.108 fast glu + 0.023 steady glu

Page 14: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Statistical sig of the βs

Linear regr t = b/SE -> p value

Logistic regr Χ2 = (b/SE)2 -> p value

Must first form (95%) CI for β on log scale

b – 1.96 SE, b + 1.96 SE

Then take antilogs of each end

e[b – 1.96 SE], e[b + 1.96 SE]

Page 15: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Diabetes complications Odds Ratio Estimates

Point 95% WaldEffect Estimate Confidence Limitsobese e0.328=1.388 0.416 4.631Fast glu e0.108=1.114 1.049 1.182Steady glu e0.023=1.023 1.012 1.033

Page 16: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Model fit-Linear vs Logistic regression

k variables, n observations Variation df sum square or deviance

Model k G

Error n-k D

Total n-1 T <-fixed

Yi= ith observation, Ŷi=prediction for ith obs

statistic Linear regr Logistic regr

D/(n-k) Residual SDe Mean devianceΣ[(Yi-Ŷi)/Ŷ]2 -- Hosmer-L χ2

Corr(Y,Ŷ)2 R2 Cox-Snell R2

G/T R2 Pseudo R2

Page 17: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0.

There are two versions of the R2 for logistic regression.

Page 18: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Goodness of fit:Deviance Deviance in logistic is like SS in linear regr

df -2log L p value

Model (G) 3 117.21 < 0.001

Error (D) 141 83.46

total (T) 144 200.67

mean deviance =83.46/141=0.59

(want mean deviance to be ≤ 1)

R2pseudo=G/total =117/201= 0.58, R2

cs =0.554

Page 19: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Goodness of fit:H-L chi sqCompare observed vs model predicted

(expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946

Page 20: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Goodness of fit vs R2

Interpretation when goodness of fit is acceptable and R2 is poor.

Need to include interactions or make transformation on X variables in model?

Need to obtain more X variables?

Page 21: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Sensitivity & Specificity

Sensitivity=a/(a+c), false neg=c/(a+c)

Specificity=d/(b+d), false pos=b/(b+d)

Accuracy = W sensitivity + (1-W) specificity

True pos True neg

Classify pos a b

Classify neg c d

total a+c b+d

Page 22: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc,

Predict positive if P > Pc

Predict negative if P < Pc

Page 23: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu

+0.023 steady glu

Pi = 1/(1+ exp(-logit))

Compute Pi for all observations, find value of Pi (call it P0) that maximizes

accuracy=0.5 sensitivity + 0.5 specificity

This is an ROC analysis using the logit (or Pi)

Page 24: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

ROC for logistic model

Page 25: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Diabetes model accuracy

True comp True no comp

Pred yes 55 11

Pred no 14 65

total 69 76

Sens=55/69= 79.7%, Spec=65/76=85.5%

Accuracy = (81.2% + 85.5%)/2 = 83.4%

Logit =0.447, P0=e0.447/(1+e0.447) = 0.61

Page 26: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

C statistic (report this)

n0=num negative, n1=num positive

Make all n0 x n1 pairs (1,0)

Concordant if

predicted P for Y=1 > predicted P for Y=0

Discordant if

predicted P for Y=1 < predicted P for Y=0

C = num concordant + 0.5 num ties

n0 x n1

C=0.949 for diabetes complication model

Page 27: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Logistic model is also a discriminant model (LDA)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

-4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0

logit(P)

fre

q

Histograms of logit scores for each group

Page 28: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Poisson RegressionY is a low positive integer, 0, 1,2, …

Model:

ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk

so

mean Y = exp(β0+ β1X1 + β2X2+…+βkXk)

dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y

100 βi is the percent change per unit change in Xi

Page 29: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

End

Page 30: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Equation for logit = log odds=depr “score”

logit = -1.8259 + 0.8332 female +

0.3578 chron ill -0.0299 income

odds depr = elogit, risk = odds/(1+odds)

coding:Female: 0 for M, 1 for F

Chron ill: 0 for no, 1 for yes

Income in 1000s

Page 31: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

Example: Depression (y/n)

Model for depression

term coeff=β SE p value

Intercept -1.8259 0.4495 0.0001

female 0.8332 0.3882 0.0319

chron ill 0.3578 0.3300 0.2782

income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s

Page 32: Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression

ORs term coeff=β OR = eβ

Intercept -1.8259 ---

female 0.8332 2.301

chron ill 0.3578 1.430

income -0.0299 0.971