Analysis Report Presentation 041515 - Team 4

ANALYSIS ON STATISTICAL MODEL THAT BEST PREDICTS HEART DISEASE

Prepared for Healthy Living, Inc.

Prepared by Team 4, LLC

April 15, 20151

Objective

• To provide a statistical model that predicts heart disease and can be implemented in a consumer facing app

• To present our methodologies and model results

• To provide our recommendation on which model should be used

• To provide necessary R code to implement the model

2

Overall Process

3

Data Preparation and Variable Selection

• Raw data set (“heart disease data_hungarian.xlsx”) includes 76 variables and 294 observations

• Response variable for predicting heart disease is a binary variable, with a value of 1 for positive, and 0 for negative (no heart disease). • recode a prior variable (originally called “num”) which included

values 1 through 4 to be a value of 1 in the new response variable

• Need to cut down the amount of variables while efficiently predicting heart disease

4

Process to Reduce Number of Variables

• Remove Variables with No Relevant Information • Thirty two variables are removed that were empty, irrelevant,

lacked any information in the data dictionary, or represented the month, day and the year

• Remove Variables with Small Variability• Some variables had one value for almost all observations

• Remove Variables with Multicollinearity• Nine variables are removed that had high multicollinearity using a

collinearity cutoff parameter of 0.6

• Remove Variables with Low Correlation with Outcome• Used a minimum absolute threshold of 10 percent to remove four

additional variables that were not related to the outcome

5

Monte Carlo Data Imputation

• In order to maximize the power of our tests, we use a Monte Carlo technique for data imputation.

• Examine the distribution of each variable and generating random numbers from these distributions to fill in the missing data

• The Monte Carlo imputation is proven to be more accurate and generally leads to less biased results.

6

Validation Set

Training Data Set Testing Data Set

Total Number of Observations 294 123

Number of patients with heart disease (% of total)

106 (36%) 115 (93%)

Number of patients without heart disease (% of total)

188 (64%) 8 (7%)

7

We were originally provided a separate training and testing set. Upon investigating the distribution of the response variable, we found that the sample size of observations that are not classified as heart disease is very small:

Validation Set• Two data sets come from fundamentally different populations

8

Proposed Models• The following models were considered in the analysis:

• K-Nearest Neighbors

• Logistic Regression

• Linear and Quadratic Discriminant Analysis

• Decision Trees: Bagging

• Decision Trees: Random Forests

• Decision Trees: Boosting

9

Analysis Results – K-Nearest Neighbors (KNN)• K=13 performed best:

• kNN method achieves the lowest false positive rate (6.6%) compared to all other methods.

10

Logistic Regression

• The variable selection technique we chose allows us to quickly isolate the variables that affect our outcome the most: We chose the variables that are most correlated with the outcome.

• This was done by running models with the most correlated variable, then with the first 2 most correlated variables, then with the first 3, etc, and comparing the accuracy of each model.

• We found that a model with the 10 highest correlated variables has the best accuracy of 84%. 11

Decision Tree – Bagging• Mean decrease in Gini index and mean decrease in accuracy

for each variable, relative to the largest. Therefore, the most important variables are oldpeak and relrest.

• While the results give insight on the most impactful variables, they do not perform well compared to the other methods. 12

Decision Tree - Random Forests• Number of variables to be randomly sampled as 4 • This is based on the commonly used method in determining the

best number for classifications with random forests, , where p is the number of features in the dataset

• While random forests improved the prediction results compared to bagging, they ultimately did not beat the other models.

13

Decision Tree – Boosting• Boosting performs best when using 14 trees, as shown by the

graph below:

• Out of all decision tree methods, boosting achieved the best values, but they were not as good as the other models.

14

Linear Discriminant Analysis and Quadratic Discriminant Analysis

• The LDA method had decent results, but was not able to pass the logistic in any of the relevant metrics (accuracy, sensitivity, false positive rate)

• QDA resulted in the method with the highest sensitivity, almost 77%, but at a cost of a higher false positive rate.

15

Model Summary

Model Accuracy Sensitivity False Positive Rate

K Nearest Neighbour 80.95% 60.71% 6.59%

Logistic Regression 84.35% 73.21% 8.79%

Decision Tree - Bagging 78.23% 66.07% 14.29%

Decision Tree - Random Forest 79.59% 66.07% 12.09%

Decision Tree - Boosting 82.99% 69.64% 8.79%

Linear Discriminant Analysis 80.95% 66.07% 9.89%

Quadratic Discriminant Analysis 81.63% 76.79% 15.38% 16

Conclusion• Logistic regression is the best model• High accuracy and reasonably high sensitivity and low false

positive rate

17

Observed - No heart disease

Observed - Heart disease

Predicted - No heart disease 83 cases 15 cases

Predicted - Heart disease 8 cases 41 cases

ROC Curve for Optimizing Logistic Regression Threshold

18

Logistic Regression Model• By using a threshold of 0.29, we are able to boost the

sensitivity of the model by 7%, while decreasing the accuracy by only 6%.

19

Logistic Regression Model

Accuracy Sensitivity False Positive Rate

Threshold = 0.5 84.35% 73.21% 8.79%

Threshold = 0.29 78.23% 80.36% 23.08%

Difference -6.12% 7.15% 14.29%

Final Model

• where the s are variables that can be summarized as:• Patient demographic information

• “sex” (patient gender)• “age” (patient age),

• Physical health• “chol” (cholesterol level), • “fbs” (indicating high blood sugar),

• Mental health• “oldpeak” (a measure of depression), • “relrest” (indicating relief after a rest),

• Exercise related indications:• “thalach” (maximum heart rate), • “thalrest” (resting heart rate), • “pro”, and “prop” (indicating ECG measurement specifications)

20

Prototype of App

21

To access the application, please navigate to the following address: www.tinyurl.com/hdprototype

http://www.tinyurl.com/hdprototype

Roles and Responsibilities

• KNN & Boosting - Ravi & Eugenia• Bagging & Random Forests - Zijian & Jiayang• Logistic & LDA/QDA - Shijie & Armen

• Final Model runs and app design - Ravi• Missing value imputation and final QA - Zijian & Jiayang• Final report writing - Armen, Eugenia, Shijie

22

Documents

Analysis Report Presentation 041515 - Team 4