Data Science and Big Data Analytics
Chap 7: Adv Analytical Theory and Methods: Classification
Charles TappertSeidenberg School of CSIS, Pace
University
Chapter Sections
7.1 Decision Trees 7.2 Naïve Bayes 7.3 Diagnostics of Classifiers 7.4 Additional Classification Models Summary
7 Classification
Classification is widely used for prediction
Most classification methods are supervised
This chapter focuses on two fundamental classification methods Decision trees Naïve Bayes
7.1 Decision Trees Tree structure specifies sequence of
decisions Given input X={x1, x2,…, xn}, predict output Y
Input attributes/features can be categorical or continuous
Node = tests a particular input variable Root node, internal nodes, leaf nodes return class
labels Depth of node = minimum steps to reach node
Branch (connects two nodes) = specifies decision
Two varieties of decision trees Classification trees: categorical output, often
binary Regression trees: numeric output
7.1 Decision Trees7.1.1 Overview of a Decision
Tree
Example of a decision tree Predicts whether customers will buy a
product
7.1 Decision Trees7.1.1 Overview of a Decision
Tree Example: will bank client subscribe to term
deposit?
7.1 Decision Trees7.1.2 The General Algorithm
Construct a tree T from training set S Requires a measure of attribute
information Simplistic method (data from previous Fig.)
Purity = probability of corresponding class E.g., P(no)=1789/2000=89.45%, P(yes)=10.55%
Entropy methods Entropy measures the impurity of an attribute Information gain measures purity of an attribute
7.1 Decision Trees7.1.2 The General Algorithm
Entropy methods of attribute information Hx = the entropy of X
Information gain of an attribute = base entropy – conditional entropy
7.1 Decision Trees7.1.2 The General Algorithm
Construct a tree T from training set S Choose root node = most informative
attribute A Partition S according to A’s values Construct subtrees T1, T2… for the subsets of
S recursively until one of following occurs All leaf nodes satisfy minimum purity threshold Tree cannot be further split with min purity
threshold Other stopping criterion satisfied – e.g., max
depth
7.1 Decision Trees7.1.3 Decision Tree Algorithms
ID3 AlgorithmT=training set, P=output variable, A=attribute
7.1 Decision Trees7.1.3 Decision Tree Algorithms
C4.5 Algorithm Handles missing data Handles both categorical and sontinuous
variables Uses bottom-up pruning to address
overfitting CART (Classification And Regression
Trees) Also handles continuous variables Uses Gini diversity index as info measure
7.1 Decision Trees7.1.4 Evaluating a Decision
Tree Decision trees are greedy algorithms
Best option at each step, maybe not best overall
Addressed by ensemble methods: random forest
Model might overfit the dataBlue = training setRed = test set
Overcome overfitting: Stop growing tree early Grow full tree, then prune
7.1 Decision Trees7.1.4 Evaluating a Decision
Tree
Decision trees -> rectangular decision regions
7.1 Decision Trees7.1.4 Evaluating a Decision
Tree
Advantages of decision trees Computationally inexpensive Outputs are easy to interpret – sequence of
tests Show importance of each input variable Decision trees handle
Both numerical and categorical attributes Categorical attributes with many distinct values Variables with nonlinear effect on outcome Variable interactions
7.1 Decision Trees7.1.4 Evaluating a Decision
Tree
Disadvantages of decision trees Sensitive to small variations in the training
data Overfitting can occur because each split
reduces training data for subsequent splits Poor if dataset contains many irrelevant
variables
7.1 Decision Trees7.1.5 Decision Trees in R
# install packages rpart,rpart.plot# put this code into Rstudio source and execute lines via Ctrl/Enterlibrary("rpart")library("rpart.plot")setwd("c:/data/rstudiofiles/")banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")## drop a few columns to simplify the treedrops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month")banktrain <- banktrain [,!(names(banktrain) %in% drops)]summary(banktrain)# Make a simple decision tree by only keeping the categorical variablesfit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1), parms=list(split='information'))summary(fit)# Plot the treerpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)
7.2 Naïve Bayes
The naïve Bayes classifier Based on Bayes’ theorem (or Bayes’ Law) Assumes the features contribute
independently Features (variables) are generally
categorical Discretization of continuous variables is the
process of converting continuous variables into categorical ones
Output is usually class label plus probability score
Log probability often used instead of probability
7.2 Naïve Bayes7.2.1 Bayes Theorem
Bayes’ Theorem
where C = class, A = observed attributes
Typical medical example Used because doctor’s frequently get this
wrong
7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier
Conditional independence assumption
And dropping common denominator, we get
Find cj that maximizes P(cj|A)
7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier
Example: client subscribes to term deposit?
The following record is from a bank client. Is this client likely to subscribe to the term deposit?
7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier
Compute probabilities for this record
7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier
Compute Naïve Bayes classifier outputs: yes/no
The client is assigned the label subscribed = yes
The scores are small, but the ratio is what counts
Using logarithms helps avoid numerical underflow
7.2 Naïve Bayes7.2.3 Smoothing
A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data E.g., Laplace smoothing assumes every
output occurs once more than occurs in the dataset
Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0
7.2 Naïve Bayes7.2.4 Diagnostics
Naïve Bayes advantages Handles missing values Robust to irrelevant variables Simple to implement Computationally efficient Handles high-dimensional data efficiently Often competitive with other learning algorithms Reasonably resistant to overfitting
Naïve Bayes disadvantages Assumes variables are conditionally independent
Therefore, sensitive to double counting correlated variables In its simplest form, used only for categorical variables
7.2 Naïve Bayes7.2.5 Naïve Bayes in R
This section explores two methods of using the naïve Bayes Classifier Manually compute probabilities from
scratch Tedious with many R calculations
Use naïve Bayes function from e1071 package
Much easier – starts on page 222
Example: subscribing to term deposit
7.2 Naïve Bayes7.2.5 Naïve Bayes in R
Get data and e1071 package> setwd("c:/data/rstudio/chapter07")> sample<-read.table("sample1.csv",header=TRUE,sep=",")> traindata<-as.data.frame(sample[1:14,]) > testdata<-as.data.frame(sample[15,]) > traindata #lists train data> testdata #lists test data, no Enrolls variable> install.packages("e1071", dep = TRUE)> library(e1071) #contains naïve Bayes function
7.2 Naïve Bayes7.2.5 Naïve Bayes in R
Perform modeling
> model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata) > model # generates model output> results<-predict(model,testdata) > Results # provides test prediction
Using a Laplace parameter gives same result
7.3 Diagnostics of Classifiers
The book covered three classifiers Logistic regression, decision trees, naïve
Bayes Tools to evaluate classifier performance
Confusion matrix
7.3 Diagnostics of Classifiers Bank marketing example
Training set of 2000 records Test set of 100 records, evaluated below
7.3 Diagnostics of Classifiers Evaluation metrics
7.3 Diagnostics of Classifiers Evaluation metrics on bank marketing 100
test set
poor
poor
7.3 Diagnostics of Classifiers ROC curve: good for evaluating binary detection
Bank marketing: 2000 training set + 100 test set> banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",") > drops<-c("balance","day","campaign","pdays","previous","month") > banktrain<-banktrain[,!(names(banktrain) %in% drops)]
> banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",")> banktest<-banktest[,!(names(banktest) %in% drops)]> nb_model<-naiveBayes(subscribed~.,data=banktrain)> nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw')> score<-nb_prediction[,c("yes")] > actual_class<-banktest$subscribed=='yes'> pred<-prediction(score,actual_class) # code problem
7.3 Diagnostics of Classifiers ROC curve: good for evaluating binary
detection Bank marketing: 2000 training set + 100 test set
7.4 Additional Classification Methods
Ensemble methods that use multiple models Bagging: bootstrap method that uses
repeated sampling with replacement Boosting: similar to bagging but iterative
procedure Random forest: uses ensemble of decision
trees These models usually have better
performance than a single decision tree Support Vector Machine (SVM)
Linear model using small number of support vectors
Summary How to choose a suitable classifier
among Decision trees, naïve Bayes, & logistic
regression
Midterm Exam – 10/28/156:10-9:00 – 2 hours, 50 minutes
30% - Clustering: k-means example 30% - Association Rules: store
transactions 30% - Regression: simple linear
example 10% - Ten multiple choice questions Note: for each of the three main
problems Manually compute algorithm on small
example Complete short answer sub questions