33
Ohio Center of Excellence in Knowledge-Enabled Computing A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury Vahid Taslimitehrani, Guozhu Dong kno.e.sis center Department of Computer Science and Engineering Wright State University Dayton, OH 1

A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Embed Size (px)

Citation preview

Page 1: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic

Brain InjuryVahid Taslimitehrani, Guozhu Dong

kno.e.sis centerDepartment of Computer Science and Engineering

Wright State UniversityDayton, OH

1

Page 2: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Outline

• Motivation and background• Preliminaries

– Contrast pattern mining – Logistic regression

• CPXR(Log)• TBI data• Results of CXR(Log) on TBI• Conclusion• References

2

Page 3: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Motivation and Background

• CPXR (Log): Accurate and informative prognostic models

Prognostic models are central to medicine. [Steyerberg, 2009] Facilitate physicians decision making process on patient

treatment plan, screening and etc. Help to understand the disease behavior including identifying

new biomarkers. Number of articles listed in PubMed with “prediction model” in

title in 2012 is 7 times of that in 2000. [pubmed]

3

Page 4: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Motivation and Background

• CPXR (Log): A powerful new generic Logistic Regression method

Logistic regression is one of the most popular approaches for building clinical prediction models. [Steyerberg, 2009]

Logistic regression models are desirable since They are representable. They are probabilistic based. They are flexible in terms of

predictor variables. (categorical and numerical variables)

4

Page 5: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Motivation and Background

• Traumatic Brain Injury

One of the leading causes of death and disability worldwide. Annually, 1.5 million death in worldwide. [Perel, 2006] $76.5 billion dollars including direct and indirect cost in 2010

in US. [www.cdc.gov] Early and accurate prognostic models based on just admission

time data to make time–critical clinical decisions by physicians.

5

Page 6: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Challenges in clinical modeling

• Accuracy of the clinical prediction models• Easiness to interpret clinical prediction models

• To explain medical decision to the patient• To identify important risk factors

• Avoid overfitting to make clinical prediction models more generalizable

• Early decision making• ABILITY to CAPTURE

– Heterogeneous patient group behavior

6

Page 7: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

CPXR works well by using several pattern local model pairs

These are different subpopulations that need different predicted models. Using just one prediction function does not work well!!

Not an extreme case! It happens very often …7

Page 8: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

How CPXR(Log) is different from other classifiers?

• CPXR introduced the idea of– using patterns to logically characterize different

subpopulations of data and – using local regression models to represent predictor

response relationship of the subpopulation– choosing a pattern only if the local model is very

accurate [Dong, 2014] • CPXR(Log)

– can capture diversified/heterogeneous behavior. – is more generalizable. – is less overfitting than other classifiers.

• CPXR(Log) is more accurate than other classifiers like SVM and Random Forest.

8

Page 9: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Traditional classification vs CPXR

Training Data Classification engine

Classifier (model)

Training Data

Classification engine

Baseline model

• Large error data• Small

error data

(Pattern 1, Model 1)

(Pattern 2, Model 2)

(Pattern k, Model k)

.

.

.

Build and selectCPs &

local models

9

Page 10: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

CPXR(Log) – PXR concept

• Definition: Let be training data for regression. Let be a regression model built on , which we will call the baseline model on . A pattern aided regression (PXR) model is a tuple , where is the pattern set of , s are local regression models of s and is the default regression model. We define the regression model of as

for each instance , where

10

Page 11: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Preliminaries: Contrast Patterns

• A toy example

• )=• • Given a threshold like 2, is a contrast pattern.• Details: We only consider one minimal generator pattern for

each “equivalency class” of contrast patterns.

TID Classb d e g ib c e g ia c e g ja c e h jb d f g i

Page 12: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Quality measures

• CPXR(Log) needs to efficiently extract a desirable pattern set from a huge search space of potential pattern sets.

• Definition: The average residual reduction (arr) of a pattern w.r.t. a model and a dataset is

• Definition: The total residual reduction (trr) of a pattern set w.r.t a model and a dataset is

where , , and .

Page 13: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

CPXR(Log) algorithm -- outline

• First step: split training dataset into two classes, and .• : instances of where baseline model makes Large Error.• : instances of where baseline model makes Small Error.

• Second step: extract all contrast patterns on satisfying .• Third step: search for a small set of pattern to maximize error

reduction and uses that set to build a model.

• Note Each pattern is associated with a local regression model built on ’s

matching data. Using a pattern and its local associated regression model is a flexible way

to represent one predictor response relationship. Different pairs represent highly different predictor response relationships.

13

Page 14: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

CPXR(Log) – details (1)

• Inputs:• Training data • Baseline model • to partition into and • threshold on contrast patterns

• Output:• A model

Let denote ’s error on ; Determine to minimize ; Let ; Discretize each numerical variable using entropy based

binning; Extract all contrast patterns for in the class (;

14

Page 15: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

CPXR(Log) – details (2)

For each , build the local regression model for data in ; Let , where is the pattern in with highest

Let be the regression model trained from ; Return ;

15

Page 16: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

TBI data

• TBI dataset is a collection of some International and US Tirilazad trials.

• 2159 instances. [Steyerberg, 2008]• 15 numerical and categorical predictor variables.• Missing instances were treated using multiple imputation.• The outcome variable is the Glascow Outcome Scale: GOS 1

(dead),…, GOS 5 (good recovery)• This study used two discretized versions of GOS: “Mortality” vs

survival (GOS1 vs GOS 2-5), “Unfavorable” vs favorable (GOS 1-3 vs GOS 4-5)Category Predictor variables

Basic Cause of injury, age, GCS motor score, pupil reactivity

Computed tomography (CT)

Hypoxia, hypotension, Marshall CT, tSAH, eDH, compressed cistern, midline shift more than 5 mm

Lab Glucose, ph, sodium, hb 16

Page 17: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – Performance of SLogR and CPXR(Log) on Mortality models

Model SLogR CPXR(Log)Specificity

Sensitivity

F1 AUC Specificity

Sensitivity

F1 AUC

Basic 0.95 0.18 0.27

0.77

0.96 0.18 0.28 0.8

Basic+CT 0.95 0.32 0.42

0.8 0.96 0.42 0.53 0.88

Basic+CT+Lab

0.94 0.36 0.46

0.8 0.97 0.46 0.58 0.92

Of course more accurate than standard logistic regression

17

Page 18: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – Performance of SLogR and CPXR(Log) on Unfavorable models

Model SLogR CPXR(Log)Specificity

Sensitivity

F1 AUC Specificity

Sensitivity

F1 AUC

Basic 0.85 0.52 0.59

0.76

0.89 0.54 0.63 0.82

Basic+CT 0.85 0.6 0.66

0.8 0.87 0.65 0.7 0.87

Basic+CT+Lab

0.84 0.61 0.66

0.81

0.91 0.72 0.76 0.93

18

Page 19: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – Impact of adding more variables on AUC

Variable set change Mortality UnfavorableCPXR(Log)

SLogR CPXR(Log) SLogR

Basic Basic +CT 10% 7.7% 6% 5.2%Basic Basic + CT + Lab 15% 11.1% 13.4% 6.6%

Mortality UnfavorableBasic

Basic+CT

Basic+CT+Lab Basic

Basic+CT

Basic+CT+Lab

11.1%

12.8% 15% 7.9%

8.8% 14.8%CPXR(Log) over SlogR

AUC improvement when more variables are used by CPXR(Log) and SLogR

19

Page 20: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – ROC curves of Basic models

20

Page 21: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results - ROC curves of (Basic + CT) models

21

Page 22: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results - ROC curves of (Basic+CT+Lab) models

22

Page 23: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – Performance comparison

CPXR(Log)Comparing CPXR(Log) performance with

- Logistic Regression- SVM- Random Forest

23

Page 24: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Example: patterns used by CPXR(Log) & Mortality (Basic+CT+Lab)

patterns arr Cov(CT classification = III) 15

%20%

(CT classification = V) AND (midline shift) AND (0.56 < glucose <= 10.4)

12%

15%

(No compressed cistern) AND (No midline shift) AND (7.22 < PH <= 7.45)

10%

40%

(10.77 < glucose <= 21.98) AND (134 < sodium <= 144) 18%

18%

(No Hypotension) AND (134 < sodium < 144) AND (10.55 < HB <= 14.57) AND (with tSAH)

19%

20%

(No tSAH) AND (134 < sodium <= 144) AND (10.77 < glucose <= 21.98) AND (No Hypotension) AND (No midline shift) AND (One reactive pupil)

19%

20%

(No tSAH) AND (One reactive pupil) 18%

40%24

Page 25: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Odds ratios

(CT classification = V) AND (midline shift) AND (0.56 < glucose <= 10.4)

25

Page 26: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Residual reduction and example patient

•Age = 15 years old•Cause of injury =

motorbike accident•GCS motor score = 5

(No eye response)•No reactive pupil•No hypoxia•No hypotension•CT scan classification

= V (mass lesion)•No tSAH•With ePDH•Has midline shift

more than 5 mm•Glucose = 9.06

mmol/l•PH = 7.37•Sodium = 141 mmol/l•Hb = 14.4 g/dl•Patient is dead.

0.78, risk of survival based on standard logistic regression!!!!

0 500 1000 1500 2000 25000

100

200

300

400

500

600

Error distribution of TBI dataset on SLogR

Patient is matched with “pattern II” and CPXR(Log) predicted 0.38 risk of survival.

26

0 500 1000 1500 2000 25000

2

4

6

8

10

12Error distribution of TBI dataset on

CPXR(Log)

Page 27: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – Box plot of RMSE reduction in CPXR

• Piecewise linear regression• Support vector regression • Bayesian additive regression tree• Gradient boosting method

How much CPXR can reduce RMSE (Root Mean Square Error) in 50 datasets comparing to

27

Page 28: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Results – Noise sensitivity and impact of the number of patterns

Number of patterns is determined by the method automatically.

How much noisy datasets can impact on the performance of CPXR and other methods?

28

Page 29: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Conclusion

• We presented an effective new method, CPXR(Log) for logistic regression and for clinical predictive modeling.

• We showed CPXR is more accurate than standard logistic regression and some other classification algorithms.

• We also presented CPXR(Log) models including patterns and local models an new odds ratios of predictor variables.

29

Page 30: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

References

• Guozhu Dong & Vahid Taslimitehrani. Pattern-Aided Regression Modeling and Prediction Model Analysis. Tech Report, CSE, Wright State Univ. 2014.

• E. Steyerberg: Clinical prediction models. Springer, 2009.• P. Perel, P. Edwards, R. Wentz, and I. Roberts: Systematic

review of prognostic models in traumatic brain injury. BMC medical informatics and decision making, 6(1): 1-10, 2006.

• G. Dong, J. Li: Efficient mining of emerging patterns: Discovering trends and differences. In Proc. KDD, 43-52, 1999.

• E.W. Steyerberg, et al: Predicting outcome after traumatic brain injury: development and international validation of prognostic scores based on admission characteristics. PLoS medicine, 5(8): e165, 2008.

30

Page 31: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Preliminaries: Logistic Regression

• Regression modeling: predicting response variable (output) based on predictor variables (input).

• Logistic regression: the response variable is binary. For example,

• “having the disease” or “not”• “mortal” or “not”

• Let X=() be a vector of predictor variables• and Y be the response variable. • The goal of logistic regression is learning a function like

satisfying

Chi-square () is one of the goodness of fit measures for logistic regression 31

Page 32: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Preliminaries: Contrast Patterns

• An item is a single variable condition of the form “A = a” or “ “• A pattern is a finite set of items.• An instance X from dataset D is said to match a pattern P, if X satisfies every item in P. • Example:

“ Age ” AND “Diagnosed with high cholesterol = YES” is a pattern with TWO items.

One instance (patient ID = 1) matches the above pattern.

Patient ID

Age BMI Sys Blood Pressure

Diagnosed with high Cholesterol

Diagnosed with Heart Failure ©

1 75 22 120 YES YES2 67 27 131 NO NO

32

Page 33: A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury

Ohio Center of Excellence in Knowledge-Enabled Computing

Preliminaries: Contrast Patterns

• The matching data of pattern P in dataset D or is the set of all instances matching pattern P.

• The support of pattern P in D is • Given 2 classes and ,the support ratio of pattern P from to

• Given a threshold , a contrast pattern (emerging pattern) of

class is a pattern P satisfying . [Dong, 1999]

33