11
Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu 11/9/16 BUS 443 Data Mining Case Problem 1: Heart Clustering Cluster 3 was the youngest at death, and also is comprised of people with the highest levels of smoking. The parallel coordinates plot indicates that the highest concentration of people who died at older ages were found in Clusters 0 and 2, however Cluster 0 entails lower Cholesterol levels, lower frequency of Smoking, lower Systolic Blood Pressure, and is significantly taller and heavier. Cluster 1 and Cluster 0 have very nearly identical cholesterol levels. Cluster 1: could be described as having a medium level of smoking, with relatively even distribution of people among each age group 50 and up, and have low cholesterol and blood pressure Cluster 3: Heavy smokers who died at younger ages Cluster 0: Taller and heavier older people with low cholesterol, high blood pressure, who are mostly non- smokers.

montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443Data Mining Case

Problem 1: Heart Clustering● Cluster 3 was the youngest at death, and also is comprised of people with the highest

levels of smoking.

● The parallel coordinates plot indicates that the highest concentration of people who died at older ages were found in Clusters 0 and 2, however Cluster 0 entails lower Cholesterol levels, lower frequency of Smoking, lower Systolic Blood Pressure, and is significantly taller and heavier. Cluster 1 and Cluster 0 have very nearly identical cholesterol levels.

● Cluster 1: could be described as having a medium level of smoking, with relatively even distribution of people among each age group 50 and up, and have low cholesterol and blood pressure

● Cluster 3: Heavy smokers who died at younger ages● Cluster 0: Taller and heavier older people with low cholesterol, high blood pressure,

who are mostly non-smokers.● Cluster 2: Older people with high cholesterol, higher blood pressure, who are shorter in

height and lighter in weight, and have low levels of smoking

Page 2: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443

● Cluster 2 is the most different and has the highest Within-Cluster SS Problem 2

Page 3: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443

● When k = 20, we can get the best model.

Page 4: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443

● Error% for the “1” class = 0% Error% for the “0” class = 33.3% It’s a good model since the error% for the whole data is only 10%.

● When we look at the information of misclassified customer, we can find that he seemed to be approved because he did well about having a house, years of credit history and

Page 5: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443revolving balance. However, his credit score was quite low, so the bank rejected him actually.

● a. When we look at the test Lift chart, we can find that the model doesn’t improve the prediction before 4 observations. After 4 observations, the model still just improves a little bit.b. The lift for first decile is 1.43. It means that if we randomly selected a customer, on average 0.7 of them would be approved. However, if we use k-NN with a best k = 20 to identify the top 1 customer most likely to be approved, then (1.43)*(0.7) = 1 of them would be predicted as “approved”.c. The ROC chart has a steep initial slope and levels off quickly, and AUC = 0.83, so the model has a great discrimination and doesn’t misclassify observations very often.

Page 6: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443

● The first, third, and sixth customers will receive approval. They all have houses, get great credit scores, and have long durations of credit history and low revolving utilizations.

Problem #3: Predicting The PVA Donors using Logistic Regression

Page 7: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443Upon grouping by the SE Cluster Codes, the SE Cluster Code with the highest R Square value was Cluster Code #52 with an R Square value of 0.3752. When analyzing the best lift from the lift chart, we observe that the best lift is at 4.6154 at a 20 percentile. The ROC chart plots how the true positive rate changes as the false positive rate changes. Based on the chart, our Max Separation (KS Static) is 0.6381. The Misclassification plot demonstrates how many observations were correctly and incorrectly classified or each value of the response variable. Our chart demonstrates that in the Donated category there was a frequency of 407 false negative(Incorrect), but a frequency of 633 for true positive(correct). However, once we change the Prediction cutoff value from .50 to .25 the results improve dramatically. The .25 cutoff value demonstrates a frequency of 240 false negative and a much higher frequency of 800 true positive. Therefore, the better prediction cutoff value for this scenario is 0.25. Lastly, the Response Profile tab displays the following numbers for predicted donors vs. those who are not

predicted to donate; Donate 1040, Not Donate 3760.

Page 8: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443

Problem #4: Building a Decision Tree Model for Heart Disease PredictionUsing the variables of Systolic, Diastolic, Weight, Smoking and Cholesterol, I found Coronary Heart Disease to be partially explained but not fully. The ROC graph was not very vertical at all so I knew something could be added to improve this. Looking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan Relative Weight and it improved the model greatly. Age when diagnosed with CHD would seem to have a strong correlation with the deaths since the earlier it is found, the easier to prevent heart disease it would be. Upon adding this, I immediately saw a much more vertical ROC graph. Taking out Cholesterol and adding in Age at Death also saw significant improvement over the previous model. This combination of variables accurately displays why people die from Coronary Heart Disease. Using this model there was only one misclassification as false positive and only 288 false negatives.

Page 9: montecarlorisksimulation.files.wordpress.com  · Web viewLooking at the tree map, I noticed that weight seemed to have little to do with the deaths so I removed it. I added in Metropolitan

Alexander Simpson, Matthew Murphy, Diana Rubio Mendoza, Ching Hsuan Hsu11/9/16

BUS 443