Modeling Categorical Dependent Variables

What will happen if we have categorical dependent variables?

A Dichotomous Dependent Variable

Noif0Yesif1

yi

According to the regression model

yi = 0 + xi1 + ei

1i0ii x)y(Ey

We define

How Do Choice Probabilities Fit In?]YesPr[]1yPr[p i1i

]NoPr[]0yPr[p i2i

1i

2i1ii

p

p)0(p)1()y(E

1i01i xp

From the definition of Expectation of a Discrete Variable

Two Requirements for a Probability

1p0 1i

1pp 2i1i

Logical Consistency

Sum Constraint

A Requirement for Regression

0yfor)x(0

1yfor)x(1e

i1i0

i1i0i

V(ei) = E[ei E(ei)]2

21i02i

21i01i

2i )x(p)x1(p)e(E

V(e) = 2I

Gauss-Markov Assumption

Two possibilities exist

Since E(ei) = 0

by the Definition of E()

Heteroskedasticity Rears Its Head

)x1)(x(

)p1(p

)p)(p1()p1(p

1i01i0

1i1i

21i1i

21i1i

21i02i

21i01i

2i )x(p)x1(p)e(E

Note that the subscript i appears on the right hand side!

The Logit Model

1i0

1i0

1i0L1i xe1

xe)x(Fp

0.0

0.2

0.4

0.6

0.8

1.0

ix

Example

Age CD Age CD Age CD

22 0 40 0 54 0 23 0 41 1 55 1 24 0 46 0 58 1 27 0 47 0 60 1 28 0 48 0 60 0 30 0 49 1 62 1 30 0 49 0 65 1 32 0 50 1 67 1 33 0 51 0 71 1 35 1 51 1 77 1 38 0 52 0 81 1

Age and Cash Discount Approval

How can we analyse these data?

Compare mean age of people having cash discount and non-CD

Non-CD: 38.6 yearsCD: 58.7 years (p

Dot-plot

AGE (years)

No

Yes

0 20 40 60 80 100

Cas

h D

isco

unt

Logistic regressionPrevalence (%) of CD according to age group

Age group # in group # %

20 - 29 5 0 0

30 - 39 6 1 17

40 - 49 7 2 29

50 - 59 7 4 57

60 - 69 5 4 80

70 - 79 2 2 100

80 - 89 1 1 100

CD

Dot-plot

0

20

40

60

80

100

0 2 4 6 8

CD %

Age group

Logistic function

0.0

0.2

0.4

0.6

0.8

1.0Probabilityof CD

x

P y x ee

x

x( )

1

ln( )

( )P y x

P y xx

1

Logistic transformation

logit of P(y|x)

P y x ee

x

x( )

1

The Expression for Not Buying

i

i

1i ue1

uep

i

i

i

i

i

1i2i

ue1

ue1

ueue1

ue1p1p

1

where ui = 0 + xi1

The Logit Is a Special Case of Bell, Keeney and Littles (1975) Market Share Theorem

J

m im

ijij

a

ap

ai1 = and ai2 = 1iue

For J = 2 and the logit model,

My Share of the Market Is My Share of the Attraction

21

1

aaa)1Pr(

where a1 is a function of Marketing Variables brought to bear on behalf of brand 1

The Model Can Be Linearized for Least Squares

i

i

1i ue1

uep

i

i

i

i

2i1i

ue

ueue1

uepp1

1

1i0i2i1i xu)ppln(

Odds and Odds Ratios Odds is the probability of an event occurring

divided by the probability of the event not occurring

An odds ratio is the ratio of the odds for two different groups An odds ratio = 1 implies equal risk in the two

groups

Probability and Odds We begin with a frequency distribution for the

variable Buying an insurance due to Mortality Risk

The probability of buying insurance to cover MR is 0.34 or 34% (50/147)

The odds of buying insurance due to MR = MR/INV = 50/97 = 0.5155

Mortality Risk (MR) 50 34%Investment (Inv) 97 66%Total 147 100%

Interpreting Odds The odds of 0.5155 can be stated in different ways:

Insurance companies can expect to win a customer by having good MR scheme instead of good money return schemes in about half of the cases

Winning a customer with good MR policy is half as likely as winning with good investment return policy

Or, inverting the odds, Winning a customer with good investment return policy

is twice as likely as winning with good MR policy

Impact of an Independent Variable

If an independent variable impacts or has a relationship to a dependent variable, it will change the odds of being in the key dependent variable group, e.g. buying the insurance policy.

The following table shows the relationship between age and buying behaviour:

Age < 40 Age >= 40 TotalMortality Risk (MR) 28 22 50Investment (Inv) 45 52 97Total 73 74 147

Odds for Independent Variable Groups

We can compute the odds of buying an insurance policy for each of the groups:

The odds of buying a MR if the customer was having age < 40 = 28/45 = 0.6222

The odds of buying a MR if the customer was having age >= 40 = 22/52 = 0.4231

Age < 40 Age >= 40 TotalMortality Risk (MR) 28 22 50Investment (Inv) 45 52 97Total 73 74 147

The Odds Ratio Measures the Effect The impact of age on busing an insurance policy is

measured by the odds ratio which equals:= the odds if age < 40 the odds if age >= 40 = 0.6222 0.4231 = 1.47

Which we interpret as: Young customers are 1.47 times more likely to buy

a MR policy as compared to old customers The odds of a buying a MR for young customers are

47% higher than the odds for old. (1.47 - 1.00) A one unit change in the independent variable age

(old to young) increases the odds of buying a MR by a factor of 1.47.

Odds & Odds ratios

Bankruptcy

Delinquency Yes No

Yes 75 175 ?

No 20 180 ?

? ? ?

P(ibk) P(nibk) Odds ibk Odds ratio

0.3 0.7 0.43

0.1 0.9 0.11

3.86

Advantages of Logit Transform

Probabilities range between zero and one Odds = P/(1-P) Odds range between zero and infinity Logit = ln(P/(1-P)) The logit transform ranges between negative infinity

and infinity

Logistic Regression Model the logarithm of the odds of an

outcome as a linear combination of predictor variables

Logit = ln(P/(1-P) = b0+b1X1+b2X2+. . . Estimate the coefficients b0, b1, b2 based on a

random sample of subjects data Determine which of the predictors are good Assess model fit Use the model to predict future cases

Logi

t

Age

Pro

babi

lity

Age

Estimating & Interpreting Logistic Regression

Logistic Regression Model

The general model for Logistic Regression is

RxxRxUxU

3211ln

Re-write to define U(x)

RxxRRxxRxU321

321

exp1)exp(

TermsTerm Definition

U(x) Logistic Regression Function

R Categorical Variable

x Continuous Variable

Parameters to be estimated

123

Properties of Logistic Regression

Dependent Variable takes on value of 0 or 1 Therefore, Pr(Y = 1 | x) = U(x) Y is transformed as an odds ratio i.e., probability an event occurs relative to its

converse Odds ratio of 1.0 indicates equal probability of an

event and its converse (p = 0.5) Natural log of the odds ratio is the logit

transformation

How to Estimate the Parameters

Logistic Regression Uses Maximum Likelihood Estimation

Maximum Likelihood Estimates

Fit the Likelihood Function

ii

ui

un

ii xUxUL

11

1

Probability an Event Occurred

Probability

an Event

Did Not Occur

ii uun

i xxx

1

101 10

10

exp11

exp1exp

We seek those values of 0 & 1 that maximize the likelihood function

How to interpret the results?- a 1-unit change in X is associated with a b-units

change in the value of the latent preference variable (Y*), which determines the dichotomous) value of the observed variable (Y))

)}(exp{}exp{}exp{}exp{

01010

1 XXbbXabXabXabXa

Interpreting Results

Interpreting Results

Alternative interpretations? P/(1-P) = exp{a+bX} is an ODDS RATIO

for two mutually exclusive odds When X changes (from X0 to X1), then the

odds ratio changes by: A 1-unit change in X (=X1-X0) is associated

with a b-units change in the log of the ODDS RATIO (logit) for Y=1

But what about a change in PROBABILITY??? (marginal effect)

What is the marginal effect on the PROBABILITY?

- Logit is a non-linear relationship, thus b is not an independent effect

- The effect of dX on P(Y) is a function of Z=a+bX (depends on all Xs)

- Thus, the effect of dX has to be evaluated at each value of Z

What can we infer about how the probability changes?

- The b-coefficient shows the sign (direction) of the relationship, but does not by itself determine the magnitude of the effect; the size ff the effect depends on the values of all the model parameters

- Hence, logit allows a specification where changes in the observed behaviour (outcome) that are due to the change in an exogenous variable (characteristic) are conditioned on the values & effects of all other characteristics

- However, the underlying relationship (effect of each characteristic on preference over observed outcome) is still linear (i.e., impact of each characteristic on preferences is independent of other characteristics)

Case Study

Predicting default behavior

Data Customer level information

Credit quality Interest rate premium

Sno CS Preminum DefaultCust 1 637 0.63 0Cust 2 653 1.68 1Cust 3 556 0.63 0Cust 4 664 1.905 0Cust 5 544 1.98 0Cust 6 632 0.78 0Cust 7 595 2.63 0Cust 8 557 3.83 0Cust 9 651 2.38 0Cust 10 666 0.73 0Cust 11 686 0.23 0Cust 12 712 0.88 0

Default values by CS

0

0.2

0.4

0.6

0.8

1

1.2

400 450 500 550 600 650 700 750 800 850

CS

Def

ault

Default values by Premium

0

0.2

0.4

0.6

0.8

1

1.2

-2 0 2 4 6 8 10

Premium

Def

ault

Default Rate by CS

0%

5%

10%

15%

20%

25%

30%

35%

< 560 560 -< 620 620 -< 700 700+

CS

Def

ault

Rat

e

Default Rate by Premium bins

0%

5%

10%

15%

20%

25%

30%

35%

< 1 1 -< 2 2 -< 3 3+

Premium

Def

ault

Rat

e

Parameter DF Estimate Std Err Wald ChiSq Pr > ChiSqIntercept 1 -1.7501 0.0518 1140.3188 ChiSqIntercept 1 3.6161 0.4638 60.7754

Parameter DF Estimate Std Err Wald ChiSq Pr > ChiSqIntercept 1 -0.0448 0.561 0.0064 0.9363

CS 1 -0.00466 0.000835 31.1981

CutoffCutoff

BadsBads GoodsGoods

10%10%% D

efau

lt

35%

Computing Cutoff

Logistic Regression Predicted Probabilities and Classification with 0.30 cutoff

Sno CS Preminum Actual Default Logistic P Pred Default ClassifyCust 1 637 0.63 0 0.01155 0 NDCust 2 653 1.68 1 0.95213 1 DCust 3 556 0.63 0 0.89124 1 DCust 4 664 1.905 0 0.84625 1 DCust 5 544 1.98 0 0.67182 1 DCust 6 632 0.78 0 0.78328 1 DCust 7 595 2.63 0 0.61989 1 DCust 8 557 3.83 0 0.00001 0 NDCust 9 651 2.38 0 0.95435 1 D

Cust 10 666 0.73 0 0.85686 1 DCust 11 686 0.23 0 0.83464 1 DCust 12 712 0.88 0 0.65759 1 DCust 13 632 2.73 0 0.17796 0 NDCust 14 619 5.32 0 0.36792 1 DCust 15 664 1.13 0 0.23750 0 NDCust 16 575 2.47 0 0.12322 0 NDCust 17 750 0.92 1 0.11146 0 NDCust 18 645 3.13 0 0.05473 0 NDCust 19 644 2.97 1 0.03869 0 NDCust 20 678 1.17 0 0.03869 0 ND

Sensitivity & Specificity

Sensitivity Power to identify positives Sensitivity = TP / (TP + FN)

Specificity Power to identify negatives Specificity = TN / (TN + FP)

ModelP N

Reality P TP FPN FN TN

Sen

sitiv

ity/S

peci

ficity

Probability cutoff

Sensitivity Specificity

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.000.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

57

Logistic regression for binary response variables

Basic Syntax:

proc logistic data=two outest=parms descending;class x3 (ref='1') c4 (ref='F') /param= ref;

model y=x1 x2 x3 c4 / rsquare lackfitselection = stepwise ctable pprob = (0 to 1 by 0.1) outroc=roc1;

proc score data=chdage1 score = parms out=scored type=parms; var age;

run;rsquare requests a generalized R2 measure for the fitted model.lackfit performs the Hosmer and Lemeshow goodness-of-fit test.ctable classifies the input response observations according to whether the predicted probability of (Y=1) is above or below some cutpoint value, for a number of cutpoint values in the range (0,1). An observation is predicted as an event, that is, in our case, 1, if the predicted probability of (Y=1) exceeds the cutpoint value. The table allows to assess the ability of the model to discriminate between the two groups of cases, Y=1 and Y=0.

58

Classification Table: The model classifies an observation as an event if its estimated probability is greater

than or equal to a given probability cutpoints.

Percentages (%)Prob. Level Event Non Event Event Non Event Correct Sensitivity Specificity FALSE POS FALSE NEG

0 57 0 43 0 57 100 0 43 .0.1 57 1 42 0 58 100 2.3 42.4 00.2 55 7 36 2 62 96.5 16.3 39.6 22.20.3 51 19 24 6 70 89.5 44.2 32 240.4 50 25 18 7 75 87.7 58.1 26.5 21.90.5 45 27 16 12 72 78.9 62.8 26.2 30.80.6 41 32 11 16 73 71.9 74.4 21.2 33.30.7 32 36 7 25 68 56.1 83.7 17.9 410.8 24 39 4 33 63 42.1 90.7 14.3 45.80.9 6 42 1 51 48 10.5 97.7 14.3 54.8

1 0 43 0 57 43 0 100 . 57

Tot Correct / Total

Correct Event/ Tot Event

Correct N.Event/ Tot N.Event

F.Pos / (F.Pos+Pos)

F.Neg / (F.Neg+Neg)

Item a b c d(a+b) / (a+b+c+d) a / (a+d) b / (b+c) c / (a+c) d / (b+d)

Correct Incorrect

59

Interpretation of SAS output - continued

Model Selection Criteria: Convergence - difference in parameter estimates is small enough.

Model Fit Statistics Criteria:

Likelihood Function:

2 * log (likelihood ) AIC = 2 * log ( max likelihood ) + 2 * k SIC = 2 * log ( max likelihood ) + log (N) * k

Testing Global Null Hypothesis: BETA=0

Likelihood ratio: ln(L intercept)- ln(L int + covariates), Score: 1st and 2nd derivative of Log(L) Wald: (coefficient / std error)2

iiy yi

n

ii ppL

11

)1(

60

Interpretation of SAS output - continued

Analysis of Maximum Likelihood Estimates Parameter estimates and significance test

Odds Ratio Estimates

Odds:

Odds ratio: Oi / Oj per unit change in covariate.

Association of Predicted Probabilities and Observed Responses Pairs: 43 (event) * 57 (non event) = 2451 Concordant (0- lower prob vs. 1- higher prob) Discordant (0- higher prob vs. 1- lower prob) Tie all other

ROC used to visualize model model prediction strength.

)exp(0

ijj

k

ji xO

61

Interpretation of SAS output - continuedClassification Table:

The model classifies an observation as an event if its estimated probability is greater than or equal to a given probability cutpoints.

Percentages (%)Prob. Level Event Non Event Event Non Event Correct Sensitivity Specificity FALSE POS FALSE NEG

0 57 0 43 0 57 100 0 43 .0.1 57 1 42 0 58 100 2.3 42.4 00.2 55 7 36 2 62 96.5 16.3 39.6 22.20.3 51 19 24 6 70 89.5 44.2 32 240.4 50 25 18 7 75 87.7 58.1 26.5 21.90.5 45 27 16 12 72 78.9 62.8 26.2 30.80.6 41 32 11 16 73 71.9 74.4 21.2 33.30.7 32 36 7 25 68 56.1 83.7 17.9 410.8 24 39 4 33 63 42.1 90.7 14.3 45.80.9 6 42 1 51 48 10.5 97.7 14.3 54.8

1 0 43 0 57 43 0 100 . 57

Tot Correct / Total

Correct Event/ Tot Event

Correct N.Event/ Tot N.Event

F.Pos / (F.Pos+Pos)

F.Neg / (F.Neg+Neg)

Item a b c d(a+b) / (a+b+c+d) a / (a+d) b / (b+c) c / (a+c) d / (b+d)

Correct Incorrect

Documents

Modeling Categorical Dependent Variables