Comparing MLE, MUE and Firth Estimates for Logistic …...Firth’s Penalized Likelihood Method • The MLE is the root when the score function (derivative of the loglikelihood) is

Comparing MLE, MUE and Firth Estimates

for Logistic Regression

Nitin R Patel, Chairman & Co-founder, Cytel Inc.

Research Affiliate, MIT

[email protected]

Copyright © Cytel Inc. 2013. All rights reserved.

2

Acknowledgements This presentation is based on joint work with:

•  Pralay Senchaudhuri, Cytel Inc. •  Hrishikesh Kulkarni, Cytel Inc.

JSM, Montreal, Aug 2013


3

Outline

•  Separation and Maximum Likelihood Estimates •  Firth’s Method of Maximum Penalized Likelihood

Estimation •  Numerical experiments comparing MUE with FirthE

when there is separation •  Near separation and problems with MLE •  Numerical experiments comparing MLE with FirthE

when there is near separation •  Conclusions



4

Maximum Likelihood Estimation •  Almost universally used method for logistic

regression models. •  ML estimates are asymptotically unbiased and have

minimum variance but not for finite samples. •  MLE’s can have serious shortcomings when applied

to datasets with the following characteristics: –  Small/moderate in size –  Unbalanced responses (Rare outcomes) –  Unequally spaced covariate values –  Many parameters relative to number of

observations.



5

Separation •  Example 1

seq# x1 x2 y1 10 10 12 11 11 13 12 12 14 13 13 15 14 14 16 15 15 17 16 16 18 17 17 19 19 19 110 10 16 011 11 17 012 12 18 013 13 19 014 14 20 015 15 21 016 16 22 017 17 23 018 18 18 019 18 24 020 19 25 0

covariate plot of data

5

10

15

20

25

30

5 10 15 20

x1

x2



6

MLE’s and Separation

•  When separation occurs one or more MLE’s do not exist.

•  In other words, one or more MLE’s are unbounded (and so are their standard errors).

•  This means that the maximum likelihood method fails to provide either point or interval estimates.



7

A useful characterization of separation

•  Separation occurs if and only if the observed vector of sufficient statistics is on the boundary of the convex hull of the (finite) set of possible sufficient statistics vectors.



8

Example 2: Simple Logistic Regression (one covariate, two parameters)

( ) 0 1logit i ixπ β β= +

( )1i iP Yπ = =

Response Yi , covariate xi for observation i Model:

∑= i iYT0 ∑= i iiYxT1

Sufficient statistics vector is (T0 , T1)

and where JSM, Montreal, Aug 2013


9

Example 2: Simple Logistic Regression (contd.)

x5101520253035404550556065707580859095100

0

200

400

600

800

1000

1200

0 5 10 15 20 25

t0: sufficient stat. for beta0

t1: s

uffic

ient

sta

t. fo

r bet

a1

∑= i iYT0 ∑= i iiYxT1Sufficient statistics vector is (T0 , T1) where and



10

Example 2 (contd.)

•  If we observe: yi = 0 for xi = 5, 10, 15, 20, 25, 30, 35, 40, 45 yi = 1 for xi = 50, 55, 60, 65, 70, 75, 80, 85,

90, 95, 100. •  The observed sufficient statistics vector is

(t0 = 11, t1 = 825). •  The MLE for β1 does not exist since (11,825)

is on the boundary of (T0, T1) space.



11

Firth’s Penalized Likelihood Method •  The MLE is the root when the score function

(derivative of the loglikelihood) is equated to zero.

•  Firth’s method removes the O(n-1) term from the bias of the MLE by modifying the score function by subtracting a penalty function.

•  The solution obtained as the root when the score function is set to zero is Firth’s Penalized Likelihood Estimate (FirthE)



12

Logistic Regression

•  The loglikelihood has the form where t is the observed sufficient statistic vector

•  The score function is therefore

•  Firth’s modified score function is

where is Fisher’s information matrix

•  Firth’s modification shrinks the MLE estimate towards zero

( ) ( )l t Kβ β β= −

( ) ( ) ( )U l t Kβ β βʹ′ ʹ′= = −

* 1 ( )( ) ( ) 1/ 2 ( )j jj

IU U trace I ββ β β

β−

⎡ ⎤∂= + ⎢ ⎥

∂⎢ ⎥⎣ ⎦( )I β



13

Boundary points of Sufficient Statistics space

Boundary Points in space of sufficient stats

0

200

400

600

800

1000

1200

0 5 10 15 20 25

t0

t1

t_0 t_10 01 51 1002 152 1953 303 2854 504 3705 755 4506 1056 5257 1407 5958 1808 660

9 2259 72010 27510 77511 33011 82512 39012 87013 45513 91014 52514 94515 60015 97516 68016 100017 76517 102018 85518 103519 95019 104520 1050

There are 40 points on the boundary of the set of possible values of (t0 , t1)


14

Comparison of MUE with FirthE when MLE does not exist

•  Several numerical experiments with one covariate models and a limited number with two covariate models.

•  Used exhaustive enumeration of t-vectors as well as Monte Carlo simulations with sample sizes of 1000.

•  We will illustrate with Example 2 data


Bias Comparison for MUE with FirthE for ED50 = 52.5


15

Based on complete enumeration

MSE Comparison for MUE with FirthE for ED50 = 52.5


16



17

Findings from numerical experiments •  Our experiments with several numerical experiments

with one covariate and some with two covariates suggest that both from the point of view of bias and Mean Square Error Firth’s method gives better estimates when there is complete separation.

•  Additional Advantages of Firth’s method are: –  Unlike MUE it does not depend on the conditional

distribution of the sufficient statistic, so it does not have problems associated with having few support points (e.g. with continuous covariates).

–  It is much faster to compute.



18

A real dataset

•  Two hundred rats treated with a toxic at four levels of dose, binary response examined was development of an intestinal tumor.The covariates were levels of dose (as factor variables) and a binary survival variable to control for death. (Data from US Toxicology Program Tech Report 405, 1991,

LogXact manual gives details.)

•  There was separation in this dataset. •  Output from current beta version of LogXact that provides

Firth’s method as an option.



19

LogXact Results

2*1-sidedModel Term Type Beta SE(Beta) Type Lower Upper P-Value%Const FirthE -3.861 2.108 Asymptotic -7.993 0.2713 0.0671dose_0 FirthE -2.873 1.937 Asymptotic -6.67 0.9241 0.1381

MUE -1.053 NA Exact -INF 1.909 0.4824dose_150 FirthE -1.24 1.438 Asymptotic -4.057 1.578 0.3886

CMLE -1.444 1.667 Exact -6.437 2.471 0.9367dose_300 FirthE -2.733 1.656 Asymptotic -5.978 0.5116 0.0988

MUE -1.677 NA Exact -INF 0.869 0.2068survival FirthE 0.09387 0.1402 Asymptotic -0.1808 0.3686 0.5030

CMLE 0.1246 0.174 Exact -0.2128 0.5058 0.5345

Point Estimate 95% Conf. Interval



20

Near Separation •  MLE is unstable – small shift in data leads to huge

change in ML estimate of coefficients seq# x1 x2 y

1 10 10 12 11 11 13 12 12 14 13 13 15 14 14 16 15 15 17 16 16 18 17 17 19 19 19 110 10 16 011 11 17 012 12 18 013 13 19 014 14 20 015 15 21 016 16 22 017 17 23 018 18 k 019 18 24 020 19 25 0

covariate plot of data

5

10

15

20

25

30

5 10 15 20

x1

x2k

Example 1 JSM, Montreal, Aug 2013


21

MLE and Near separation: Example 1 (contd.)

coefficients vs k

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 5 10 15 20

k

bet

a

beta1 beta2


MLE

Interior Points grouped into Layers by closeness to the boundary

Interior Point Layers

0

200

400

600

800

1000

1200

0 5 10 15 20

t0

t1

Layer 1 Layer 5 Layer 10 Layer 20 Layer 40 Layer 50


22



24

Bias Comparison of MLE to FirthE ED50=52.5



Bias Comparison of MLE to FirthE ED50=5


25



Bias Comparison of MLE to FirthE ED50=100


26



Significant Models (pval < 0.05) Bias Comparison of MLE to FirthE

ED50=52.5


27

JSM, Montreal, Aug 2013 Based on complete enumeration

MSE Comparison of MLE to FirthE ED50 = 52.5


28



MSE Comparison of MLE to FirthE ED50 = 5


29



MSE Comparison of MLE to FirthE ED50 = 100


30



Significant Models (pval < 0.05) MSE Comparison of MLE to FirthE

ED50=52.5


31




32

Conclusions from Experiments

•  Our numerical experiments and simulations suggest that FirthE reduces bias as well as Mean Square Error in comparison to MLE when the maximum slope of the logistic curve is not very high.

•  However when the max slope is high the FirthE correction for bias produces excessive shrinkage and the MLE is superior.

•  In many data sets that arise in we don’t expect large changes in response for small changes in the covariate values so FirthE will be superior

•  We conjecture that this conclusion will also hold when we compare conditional MLE and conditional FirthE


Detecting near separation in data sets

•  We have a research project to create an index to signal near separation in data sets to alert LogXact users about the bias in MLE.

•  Please let us know if you have datasets you can share which seem to exhibit near separation

•  Experiments suggest that we can use Confidence Intervals based on the Firth Profile Likelihood to detect near separation.

•  The ratio of the Upper CI width to the Lower CI appears to have promise as an index of near separation



33


34

Example 2: Simple Logistic Regression (contd.)

x5101520253035404550556065707580859095100

0

200

400

600

800

1000

1200

0 5 10 15 20 25

t0: sufficient stat. for beta0

t1: s

uffic

ient

sta

t. fo

r bet

a1

∑= i iYT0 ∑= i iiYxT1Sufficient statistics vector is (T0 , T1) where and


Interior Points grouped into Layers by closeness to the boundary


35


Ratios were calculated for each interior point

Ratio of Firth Profile Likelihood 95%CI widths

Ratio = UCIwidth/LCIwidth

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 10 20 30 40 50 60

# Layers from boundary

Ratio


36


Fitted polynomial

Thank you! [email protected]



37

Documents

Comparing MLE, MUE and Firth Estimates for Logistic …...Firth’s Penalized Likelihood Method • The MLE is the root when the score function (derivative of the loglikelihood) is