Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification...

Preview:

Citation preview

Data Science and Big Data Analytics

Chap 7: Adv Analytical Theory and Methods: Classification

Charles TappertSeidenberg School of CSIS, Pace

University

Chapter Sections

7.1 Decision Trees 7.2 Naïve Bayes 7.3 Diagnostics of Classifiers 7.4 Additional Classification Models Summary

7 Classification

Classification is widely used for prediction

Most classification methods are supervised

This chapter focuses on two fundamental classification methods Decision trees Naïve Bayes

7.1 Decision Trees Tree structure specifies sequence of

decisions Given input X={x1, x2,…, xn}, predict output Y

Input attributes/features can be categorical or continuous

Node = tests a particular input variable Root node, internal nodes, leaf nodes return class

labels Depth of node = minimum steps to reach node

Branch (connects two nodes) = specifies decision

Two varieties of decision trees Classification trees: categorical output, often

binary Regression trees: numeric output

7.1 Decision Trees7.1.1 Overview of a Decision

Tree

Example of a decision tree Predicts whether customers will buy a

product

7.1 Decision Trees7.1.1 Overview of a Decision

Tree Example: will bank client subscribe to term

deposit?

7.1 Decision Trees7.1.2 The General Algorithm

Construct a tree T from training set S Requires a measure of attribute

information Simplistic method (data from previous Fig.)

Purity = probability of corresponding class E.g., P(no)=1789/2000=89.45%, P(yes)=10.55%

Entropy methods Entropy measures the impurity of an attribute Information gain measures purity of an attribute

7.1 Decision Trees7.1.2 The General Algorithm

Entropy methods of attribute information Hx = the entropy of X

Information gain of an attribute = base entropy – conditional entropy

7.1 Decision Trees7.1.2 The General Algorithm

Construct a tree T from training set S Choose root node = most informative

attribute A Partition S according to A’s values Construct subtrees T1, T2… for the subsets of

S recursively until one of following occurs All leaf nodes satisfy minimum purity threshold Tree cannot be further split with min purity

threshold Other stopping criterion satisfied – e.g., max

depth

7.1 Decision Trees7.1.3 Decision Tree Algorithms

ID3 AlgorithmT=training set, P=output variable, A=attribute

7.1 Decision Trees7.1.3 Decision Tree Algorithms

C4.5 Algorithm Handles missing data Handles both categorical and sontinuous

variables Uses bottom-up pruning to address

overfitting CART (Classification And Regression

Trees) Also handles continuous variables Uses Gini diversity index as info measure

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree Decision trees are greedy algorithms

Best option at each step, maybe not best overall

Addressed by ensemble methods: random forest

Model might overfit the dataBlue = training setRed = test set

Overcome overfitting: Stop growing tree early Grow full tree, then prune

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree

Decision trees -> rectangular decision regions

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree

Advantages of decision trees Computationally inexpensive Outputs are easy to interpret – sequence of

tests Show importance of each input variable Decision trees handle

Both numerical and categorical attributes Categorical attributes with many distinct values Variables with nonlinear effect on outcome Variable interactions

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree

Disadvantages of decision trees Sensitive to small variations in the training

data Overfitting can occur because each split

reduces training data for subsequent splits Poor if dataset contains many irrelevant

variables

7.1 Decision Trees7.1.5 Decision Trees in R

# install packages rpart,rpart.plot# put this code into Rstudio source and execute lines via Ctrl/Enterlibrary("rpart")library("rpart.plot")setwd("c:/data/rstudiofiles/")banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")## drop a few columns to simplify the treedrops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month")banktrain <- banktrain [,!(names(banktrain) %in% drops)]summary(banktrain)# Make a simple decision tree by only keeping the categorical variablesfit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1), parms=list(split='information'))summary(fit)# Plot the treerpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)

7.2 Naïve Bayes

The naïve Bayes classifier Based on Bayes’ theorem (or Bayes’ Law) Assumes the features contribute

independently Features (variables) are generally

categorical Discretization of continuous variables is the

process of converting continuous variables into categorical ones

Output is usually class label plus probability score

Log probability often used instead of probability

7.2 Naïve Bayes7.2.1 Bayes Theorem

Bayes’ Theorem

where C = class, A = observed attributes

Typical medical example Used because doctor’s frequently get this

wrong

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Conditional independence assumption

And dropping common denominator, we get

Find cj that maximizes P(cj|A)

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Example: client subscribes to term deposit?

The following record is from a bank client. Is this client likely to subscribe to the term deposit?

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Compute probabilities for this record

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Compute Naïve Bayes classifier outputs: yes/no

The client is assigned the label subscribed = yes

The scores are small, but the ratio is what counts

Using logarithms helps avoid numerical underflow

7.2 Naïve Bayes7.2.3 Smoothing

A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data E.g., Laplace smoothing assumes every

output occurs once more than occurs in the dataset

Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0

7.2 Naïve Bayes7.2.4 Diagnostics

Naïve Bayes advantages Handles missing values Robust to irrelevant variables Simple to implement Computationally efficient Handles high-dimensional data efficiently Often competitive with other learning algorithms Reasonably resistant to overfitting

Naïve Bayes disadvantages Assumes variables are conditionally independent

Therefore, sensitive to double counting correlated variables In its simplest form, used only for categorical variables

7.2 Naïve Bayes7.2.5 Naïve Bayes in R

This section explores two methods of using the naïve Bayes Classifier Manually compute probabilities from

scratch Tedious with many R calculations

Use naïve Bayes function from e1071 package

Much easier – starts on page 222

Example: subscribing to term deposit

7.2 Naïve Bayes7.2.5 Naïve Bayes in R

Get data and e1071 package> setwd("c:/data/rstudio/chapter07")> sample<-read.table("sample1.csv",header=TRUE,sep=",")> traindata<-as.data.frame(sample[1:14,]) > testdata<-as.data.frame(sample[15,]) > traindata #lists train data> testdata #lists test data, no Enrolls variable> install.packages("e1071", dep = TRUE)> library(e1071) #contains naïve Bayes function

7.2 Naïve Bayes7.2.5 Naïve Bayes in R

Perform modeling

> model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata) > model # generates model output> results<-predict(model,testdata) > Results # provides test prediction

Using a Laplace parameter gives same result

7.3 Diagnostics of Classifiers

The book covered three classifiers Logistic regression, decision trees, naïve

Bayes Tools to evaluate classifier performance

Confusion matrix

7.3 Diagnostics of Classifiers Bank marketing example

Training set of 2000 records Test set of 100 records, evaluated below

7.3 Diagnostics of Classifiers Evaluation metrics

7.3 Diagnostics of Classifiers Evaluation metrics on bank marketing 100

test set

poor

poor

7.3 Diagnostics of Classifiers ROC curve: good for evaluating binary detection

Bank marketing: 2000 training set + 100 test set> banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",") > drops<-c("balance","day","campaign","pdays","previous","month") > banktrain<-banktrain[,!(names(banktrain) %in% drops)]

> banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",")> banktest<-banktest[,!(names(banktest) %in% drops)]> nb_model<-naiveBayes(subscribed~.,data=banktrain)> nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw')> score<-nb_prediction[,c("yes")] > actual_class<-banktest$subscribed=='yes'> pred<-prediction(score,actual_class) # code problem

7.3 Diagnostics of Classifiers ROC curve: good for evaluating binary

detection Bank marketing: 2000 training set + 100 test set

7.4 Additional Classification Methods

Ensemble methods that use multiple models Bagging: bootstrap method that uses

repeated sampling with replacement Boosting: similar to bagging but iterative

procedure Random forest: uses ensemble of decision

trees These models usually have better

performance than a single decision tree Support Vector Machine (SVM)

Linear model using small number of support vectors

Summary How to choose a suitable classifier

among Decision trees, naïve Bayes, & logistic

regression

Midterm Exam – 10/28/156:10-9:00 – 2 hours, 50 minutes

30% - Clustering: k-means example 30% - Association Rules: store

transactions 30% - Regression: simple linear

example 10% - Ten multiple choice questions Note: for each of the three main

problems Manually compute algorithm on small

example Complete short answer sub questions

Recommended