Upload
mohammed-al-hamadi
View
476
Download
3
Embed Size (px)
Citation preview
Predicting Wine Quality Using Different Implementations of Decision Tree Algorithm in R
MOHAMMED ALHAMADI - PROJECT 1
Acknowledgement
This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon
School of Engineering, NYU.
Outline1. Data set
2. Data exploration and visualization
3. Factorizing a variable
4. Splitting data into training and testing
5. Using (C50) Library
6. Using (Tree) library
7. Using (rpart) library
8. Results Comparison
Data Set• The data set contains 4898 observations on white wine varieties and quality ranked by the wine tasters
• The data set contains 11 independent variables and 1 dependent variable• The Independent variables include: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol
• The dependent variable is the quality of the wine ranked from 3 (lowest quality) to 9 (highest quality)
Data Exploration and Visualizationwine_data <- read.csv("C:/Users/Mohammed/Google Drive/R_code/Project1/winequality-white.csv", header=TRUE, sep=";")dim(wine_data)
[1] 4898 12
names(wine_data)
[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" [5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density" [9] "pH" "sulphates" "alcohol" "quality"
Data Exploration and Visualization (cont.)
'data.frame': 4898 obs. of 12 variables: $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... $ density : num 1.001 0.994 0.995 0.996 0.996 ... $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
str(wine_data)
Data Exploration and Visualization (cont.)
cor(wine_data)
Fixed acid Vol. acid Citric acid Res.Sugar Chlorides FS dioxide TS dioxide Density pH Sulphates Alcohol Quality
Fixed acid 1 -0.02 0.29 0.09 0.02 -0.05 0.1 0.27 -0.43 -0.02 -0.12 -0.11
Vol. acid -0.02 1 -0.15 0.06 0.07 -0.1 0.09 0.03 -0.03 -0.04 0.07 -0.19
Citric acid 0.29 -0.15 1 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01
Res.Sugar 0.09 0.06 0.09 1 0.09 0.3 0.4 0.83 -0.2 -0.02 -0.05 -0.1
Chlorides 0.02 0.07 0.11 0.09 1 0.1 0.2 0.26 -0.1 0.02 -0.36 -0.21
FS dioxide -0.05 -0.1 0.09 0.3 0.1 1 0.62 0.3 -0.01 0.01 -0.25 0.01
TS dioxide 0.1 0.09 0.12 0.4 0.2 0.62 1 0.53 0.01 0.13 -0.45 -0.17
Density 0.27 0.03 0.15 0.83 0.26 0.3 0.53 1 -0.1 0.07 -0.8 -0.31
pH -0.43 -0.03 -0.16 -0.2 -0.1 -0.01 0.01 -0.1 1 0.16 0.12 0.1
Sulphates -0.02 -0.04 0.06 -0.02 0.02 0.01 0.13 0.07 0.16 1 -0.02 0.05
Alcohol -0.12 0.07 -0.08 -0.05 -0.36 -0.25 -0.45 -0.8 0.12 -0.02 1 0.44
Quality -0.11 -0.19 -0.01 -0.1 -0.21 0.01 -0.17 -0.31 0.1 0.05 0.44 1
Data Exploration and Visualization (cont.)hist(wine_data$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1)
hist(wine_data$density, col="#BCEE6B", main="Histogram of Wine Density", xlab="Density", ylab="Number of samples", las=1)
hist(wine_data$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1)
Data Exploration and Visualization (cont.)hist(wine_data$quality, col="#458B74", main="Wine Quality Histogram", xlab="Quality", ylab="Number of samples")
typeof(wine_data$fixed.acidity)
typeof(wine_data$volatile.acidity)
typeof(wine_data$citric.acid)
typeof(wine_data$residual.sugar)
typeof(wine_data$chlorides)
typeof(wine_data$free.sulfur.dioxide)
typeof(wine_data$total.sulfur.dioxide)
typeof(wine_data$density)
typeof(wine_data$pH)
typeof(wine_data$sulphates)
typeof(wine_data$alcohol)
typeof(wine_data$quality)
Data Exploration and Visualization (cont.)• Explore the exact types of each column in the data
[1] "double"
[1] "double“
[1] "double"
[1] "double“
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "double"
[1] "integer"
Factorizing a variable• Frequency of each quality level:
• 45% of the scores are at score 6
• The categorical variable we want is either: High or Low, so we have 2 options
• For better score distribution, we’ll choose low scores to be from 1 to 5 and high scores from 6 to 9
table(wine_data$quality)
3 4 5 6 7 8 9
20 163 1457 2198 880 175 5
High: scores from 6 to 9 (67%)Low: scores from 1 to 5 (33%)
High: scores from 7 to 9 (22%)Low: scores from 1 to 6 (78%)
Factorizing a variable (cont.)quality_fac <- ifelse(wine_data$quality >= 6, "high", "low")
wine_data <- data.frame(wine_data, quality_fac)
table(wine_data$quality_fac)
High Low
3258 1640
• We can now remove the old integer quality variable
wine_data <- wine_data[,-12]
Splitting data into training and testing
set.seed(71)training_size <- round(0.8 * dim(wine_data)[1])training_sample <- sample(dim(wine_data)[1], training_size, replace=FALSE)
training_data <- wine_data[training_sample,]testing_data <- wine_data[-training_sample,]
• We set a seed so that we can reproduce results• 80% of the data set will be training data• 20% will be testing data
Using (C50) LibraryC50_model <- C5.0(quality_fac~., data=training_data)predict_C50 <- predict(C50_model, testing_data[,-12])testing_high <- quality_fac[testing_sample]
# missclassification errormean(predict_C50 != testing_high)
[1] 0.2265306
• So the misclassification error for this model is almost 23%
Using (C50) Library (cont.)predict_C50_num <- as.numeric(predict_C50)actual_num <- as.numeric(testing_data$quality_fac)pr <- prediction(predict_C50_num, actual_num)auc_data1 <- performance(pr, "tpr", "fpr")plot(auc_data1, main="ROC Curve for C50 Model")
Using (C50) Library (cont.)
aucval1 <- performance(pr, measure="auc")[email protected][[1]] # area under the curve value = 0.7444854
[1] 0.7444854
• So, the area under the curve value for the C50 model = 0.7444854
Using (Tree) Librarytree_model <- tree(quality_fac~., data=training_data)predict_tree <- predict(tree_model, testing_data[,-12], type="class")
mean(predict_tree != testing_high)
[1] 0.2663265
• So the misclassification error for the tree model is almost 27%
Using (Tree) Library (cont.)plot(tree_model)text(tree_model, pretty=0)
Using (Tree) Library (cont.)predict_tree_num <- as.numeric(predict_tree)pr2 <- prediction(predict_tree_num, actual_num)auc_data2 <- performance(pr2, "tpr", "fpr")plot(auc_data2, main="ROC Curve for Tree Model")
Using (Tree) Library (cont.)aucval2 <- performance(pr2, measure="auc")[email protected][[1]]
[1] 0.6439793
• So, the area under the curve value for the tree model = 0.6439793
Using (rpart) libraryrpart_model <- rpart(quality_fac~., data=training_data, method="class")predict_rpart <- predict(rpart_model, testing_data[,-12], type="class")
mean(predict_rpart != testing_high)
[1] 0.2428571
• So the misclassification error for the tree model is almost 24%
Using (rpart) library (cont.)
rpart.plot(rpart_model, extra=101)
• We can plot the tree and show the correctly and incorrectly classified instances
Using (rpart) library (cont.)predict_rpart_num <- as.numeric(predict_rpart)pr3 <- prediction(predict_rpart_num, actual_num)auc_data3 <- performance(pr3, "tpr", "fpr")plot(auc_data3, main="ROC Curve for RPART Model")
Using (rpart) library (cont.)aucval3 <- performance(pr3, measure="auc")[email protected][[1]]
[1] 0.7118481
• So, the area under the curve value for the tree model = 0.7118481
Results ComparisonC50 Model Tree Model RPART Model
table(testing, predicted=predict_rpart)table(testing, predicted=predict_tree)table(testing, predicted=predict_C50)
Predicted
Testing High Low
High 545 112
Low 110 213
Predicted
Testing High Low
High 596 61
Low 200 123
Predicted
Testing High Low
High 555 102
Low 136 187
• 758 correctly classified (77%)• 222 incorrectly classified (23%)• TPR (Sensitivity) = 545/657 = 83%• FPR (Fall-out) = 110/323 = 34%
• 719 correctly classified (73%)• 261 incorrectly classified (27%)• TPR (Sensitivity) = 596/657 = 91%• FPR (Fall-out) = 200/323 = 62%
• 742 correctly classified (76%)• 238 incorrectly classified (24%)• TPR (Sensitivity) = 555/657 = 84%• FPR (Fall-out) = 136/323 = 42%
Results Comparison (cont.)
C50 Model Tree Model RPART Model
Area Under Curve = 0.7444854 Area Under Curve = 0.6439793 Area Under Curve = 0.7118481
Referenceso Wine Quality Dataset
oDSO 530: Decision Trees in R (Classification)
o Analysis of Wine Quality Data
o Scatterplots
o Tree Based Models
o R - Classification Trees (part 2 using rpart)
o Information retrieval – Wikipedia
o What does AUC stand for and what is it?