22
ANALYSIS ON STATISTICAL MODEL THAT BEST PREDICTS HEART DISEASE Prepared for Healthy Living, Inc. Prepared by Team 4, LLC April 15, 2015 1

Analysis Report Presentation 041515 - Team 4

Embed Size (px)

Citation preview

Page 1: Analysis Report Presentation 041515 - Team 4

ANALYSIS ON STATISTICAL MODEL THAT BEST PREDICTS HEART DISEASE

Prepared for Healthy Living, Inc.

Prepared by Team 4, LLC

April 15, 20151

Page 2: Analysis Report Presentation 041515 - Team 4

Objective

• To provide a statistical model that predicts heart disease and can be implemented in a consumer facing app

• To present our methodologies and model results

• To provide our recommendation on which model should be used

• To provide necessary R code to implement the model

2

Page 3: Analysis Report Presentation 041515 - Team 4

Overall Process

3

Page 4: Analysis Report Presentation 041515 - Team 4

Data Preparation and Variable Selection

• Raw data set (“heart disease data_hungarian.xlsx”) includes 76 variables and 294 observations

• Response variable for predicting heart disease is a binary variable, with a value of 1 for positive, and 0 for negative (no heart disease). • recode a prior variable (originally called “num”) which included

values 1 through 4 to be a value of 1 in the new response variable

• Need to cut down the amount of variables while efficiently predicting heart disease

4

Page 5: Analysis Report Presentation 041515 - Team 4

Process to Reduce Number of Variables

• Remove Variables with No Relevant Information • Thirty two variables are removed that were empty, irrelevant,

lacked any information in the data dictionary, or represented the month, day and the year

• Remove Variables with Small Variability• Some variables had one value for almost all observations

• Remove Variables with Multicollinearity• Nine variables are removed that had high multicollinearity using a

collinearity cutoff parameter of 0.6

• Remove Variables with Low Correlation with Outcome• Used a minimum absolute threshold of 10 percent to remove four

additional variables that were not related to the outcome

5

Page 6: Analysis Report Presentation 041515 - Team 4

Monte Carlo Data Imputation

• In order to maximize the power of our tests, we use a Monte Carlo technique for data imputation.

• Examine the distribution of each variable and generating random numbers from these distributions to fill in the missing data

• The Monte Carlo imputation is proven to be more accurate and generally leads to less biased results.

6

Page 7: Analysis Report Presentation 041515 - Team 4

Validation Set

Training Data Set Testing Data Set

Total Number of Observations 294 123

Number of patients with heart disease (% of total)

106 (36%) 115 (93%)

Number of patients without heart disease (% of total)

188 (64%) 8 (7%)

7

We were originally provided a separate training and testing set. Upon investigating the distribution of the response variable, we found that the sample size of observations that are not classified as heart disease is very small:

Page 8: Analysis Report Presentation 041515 - Team 4

Validation Set• Two data sets come from fundamentally different populations

8

Page 9: Analysis Report Presentation 041515 - Team 4

Proposed Models• The following models were considered in the analysis:

• K-Nearest Neighbors

• Logistic Regression

• Linear and Quadratic Discriminant Analysis

• Decision Trees: Bagging

• Decision Trees: Random Forests

• Decision Trees: Boosting

9

Page 10: Analysis Report Presentation 041515 - Team 4

Analysis Results – K-Nearest Neighbors (KNN)• K=13 performed best:

• kNN method achieves the lowest false positive rate (6.6%) compared to all other methods.

10

Page 11: Analysis Report Presentation 041515 - Team 4

Logistic Regression

• The variable selection technique we chose allows us to quickly isolate the variables that affect our outcome the most: We chose the variables that are most correlated with the outcome.

• This was done by running models with the most correlated variable, then with the first 2 most correlated variables, then with the first 3, etc, and comparing the accuracy of each model.

• We found that a model with the 10 highest correlated variables has the best accuracy of 84%. 11

Page 12: Analysis Report Presentation 041515 - Team 4

Decision Tree – Bagging• Mean decrease in Gini index and mean decrease in accuracy

for each variable, relative to the largest. Therefore, the most important variables are oldpeak and relrest.

• While the results give insight on the most impactful variables, they do not perform well compared to the other methods. 12

Page 13: Analysis Report Presentation 041515 - Team 4

Decision Tree - Random Forests• Number of variables to be randomly sampled as 4 • This is based on the commonly used method in determining the

best number for classifications with random forests, , where p is the number of features in the dataset

• While random forests improved the prediction results compared to bagging, they ultimately did not beat the other models.

13

Page 14: Analysis Report Presentation 041515 - Team 4

Decision Tree – Boosting• Boosting performs best when using 14 trees, as shown by the

graph below:

• Out of all decision tree methods, boosting achieved the best values, but they were not as good as the other models.

14

Page 15: Analysis Report Presentation 041515 - Team 4

Linear Discriminant Analysis and Quadratic Discriminant Analysis

• The LDA method had decent results, but was not able to pass the logistic in any of the relevant metrics (accuracy, sensitivity, false positive rate)

• QDA resulted in the method with the highest sensitivity, almost 77%, but at a cost of a higher false positive rate.

15

Page 16: Analysis Report Presentation 041515 - Team 4

Model Summary

Model Accuracy Sensitivity False Positive Rate

K Nearest Neighbour 80.95% 60.71% 6.59%

Logistic Regression 84.35% 73.21% 8.79%

Decision Tree - Bagging 78.23% 66.07% 14.29%

Decision Tree - Random Forest 79.59% 66.07% 12.09%

Decision Tree - Boosting 82.99% 69.64% 8.79%

Linear Discriminant Analysis 80.95% 66.07% 9.89%

Quadratic Discriminant Analysis 81.63% 76.79% 15.38% 16

Page 17: Analysis Report Presentation 041515 - Team 4

Conclusion• Logistic regression is the best model• High accuracy and reasonably high sensitivity and low false

positive rate

17

Observed - No heart disease

Observed - Heart disease

Predicted - No heart disease 83 cases 15 cases

Predicted - Heart disease 8 cases 41 cases

Page 18: Analysis Report Presentation 041515 - Team 4

ROC Curve for Optimizing Logistic Regression Threshold

18

Page 19: Analysis Report Presentation 041515 - Team 4

Logistic Regression Model• By using a threshold of 0.29, we are able to boost the

sensitivity of the model by 7%, while decreasing the accuracy by only 6%.

19

Logistic Regression Model

Accuracy Sensitivity False Positive Rate

Threshold = 0.5 84.35% 73.21% 8.79%

Threshold = 0.29 78.23% 80.36% 23.08%

Difference -6.12% 7.15% 14.29%

Page 20: Analysis Report Presentation 041515 - Team 4

Final Model

• where the s are variables that can be summarized as:• Patient demographic information

• “sex” (patient gender)• “age” (patient age),

• Physical health• “chol” (cholesterol level), • “fbs” (indicating high blood sugar),

• Mental health• “oldpeak” (a measure of depression), • “relrest” (indicating relief after a rest),

• Exercise related indications:• “thalach” (maximum heart rate), • “thalrest” (resting heart rate), • “pro”, and “prop” (indicating ECG measurement specifications)

20

Page 21: Analysis Report Presentation 041515 - Team 4

Prototype of App

21

To access the application, please navigate to the following address: www.tinyurl.com/hdprototype

Page 22: Analysis Report Presentation 041515 - Team 4

Roles and Responsibilities

• KNN & Boosting - Ravi & Eugenia• Bagging & Random Forests - Zijian & Jiayang• Logistic & LDA/QDA - Shijie & Armen

• Final Model runs and app design - Ravi• Missing value imputation and final QA - Zijian & Jiayang• Final report writing - Armen, Eugenia, Shijie

22