Transcript
Page 1: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

Data Science and Big Data Analytics

Chap 7: Adv Analytical Theory and Methods: Classification

Charles TappertSeidenberg School of CSIS, Pace

University

Page 2: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

Chapter Sections

7.1 Decision Trees 7.2 Naïve Bayes 7.3 Diagnostics of Classifiers 7.4 Additional Classification Models Summary

Page 3: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7 Classification

Classification is widely used for prediction

Most classification methods are supervised

This chapter focuses on two fundamental classification methods Decision trees Naïve Bayes

Page 4: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees Tree structure specifies sequence of

decisions Given input X={x1, x2,…, xn}, predict output Y

Input attributes/features can be categorical or continuous

Node = tests a particular input variable Root node, internal nodes, leaf nodes return class

labels Depth of node = minimum steps to reach node

Branch (connects two nodes) = specifies decision

Two varieties of decision trees Classification trees: categorical output, often

binary Regression trees: numeric output

Page 5: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.1 Overview of a Decision

Tree

Example of a decision tree Predicts whether customers will buy a

product

Page 6: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.1 Overview of a Decision

Tree Example: will bank client subscribe to term

deposit?

Page 7: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.2 The General Algorithm

Construct a tree T from training set S Requires a measure of attribute

information Simplistic method (data from previous Fig.)

Purity = probability of corresponding class E.g., P(no)=1789/2000=89.45%, P(yes)=10.55%

Entropy methods Entropy measures the impurity of an attribute Information gain measures purity of an attribute

Page 8: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.2 The General Algorithm

Entropy methods of attribute information Hx = the entropy of X

Information gain of an attribute = base entropy – conditional entropy

Page 9: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.2 The General Algorithm

Construct a tree T from training set S Choose root node = most informative

attribute A Partition S according to A’s values Construct subtrees T1, T2… for the subsets of

S recursively until one of following occurs All leaf nodes satisfy minimum purity threshold Tree cannot be further split with min purity

threshold Other stopping criterion satisfied – e.g., max

depth

Page 10: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.3 Decision Tree Algorithms

ID3 AlgorithmT=training set, P=output variable, A=attribute

Page 11: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.3 Decision Tree Algorithms

C4.5 Algorithm Handles missing data Handles both categorical and sontinuous

variables Uses bottom-up pruning to address

overfitting CART (Classification And Regression

Trees) Also handles continuous variables Uses Gini diversity index as info measure

Page 12: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree Decision trees are greedy algorithms

Best option at each step, maybe not best overall

Addressed by ensemble methods: random forest

Model might overfit the dataBlue = training setRed = test set

Overcome overfitting: Stop growing tree early Grow full tree, then prune

Page 13: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree

Decision trees -> rectangular decision regions

Page 14: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree

Advantages of decision trees Computationally inexpensive Outputs are easy to interpret – sequence of

tests Show importance of each input variable Decision trees handle

Both numerical and categorical attributes Categorical attributes with many distinct values Variables with nonlinear effect on outcome Variable interactions

Page 15: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.4 Evaluating a Decision

Tree

Disadvantages of decision trees Sensitive to small variations in the training

data Overfitting can occur because each split

reduces training data for subsequent splits Poor if dataset contains many irrelevant

variables

Page 16: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.1 Decision Trees7.1.5 Decision Trees in R

# install packages rpart,rpart.plot# put this code into Rstudio source and execute lines via Ctrl/Enterlibrary("rpart")library("rpart.plot")setwd("c:/data/rstudiofiles/")banktrain <- read.table("bank-sample.csv",header=TRUE,sep=",")## drop a few columns to simplify the treedrops<-c("age", "balance", "day", "campaign", "pdays", "previous", "month")banktrain <- banktrain [,!(names(banktrain) %in% drops)]summary(banktrain)# Make a simple decision tree by only keeping the categorical variablesfit <- rpart(subscribed ~ job + marital + education + default + housing + loan + contact + poutcome,method="class",data=banktrain,control=rpart.control(minsplit=1), parms=list(split='information'))summary(fit)# Plot the treerpart.plot(fit, type=4, extra=2, clip.right.labs=FALSE, varlen=0, faclen=3)

Page 17: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes

The naïve Bayes classifier Based on Bayes’ theorem (or Bayes’ Law) Assumes the features contribute

independently Features (variables) are generally

categorical Discretization of continuous variables is the

process of converting continuous variables into categorical ones

Output is usually class label plus probability score

Log probability often used instead of probability

Page 18: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.1 Bayes Theorem

Bayes’ Theorem

where C = class, A = observed attributes

Typical medical example Used because doctor’s frequently get this

wrong

Page 19: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Conditional independence assumption

And dropping common denominator, we get

Find cj that maximizes P(cj|A)

Page 20: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Example: client subscribes to term deposit?

The following record is from a bank client. Is this client likely to subscribe to the term deposit?

Page 21: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Compute probabilities for this record

Page 22: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.2 Naïve Bayes Classifier

Compute Naïve Bayes classifier outputs: yes/no

The client is assigned the label subscribed = yes

The scores are small, but the ratio is what counts

Using logarithms helps avoid numerical underflow

Page 23: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.3 Smoothing

A smoothing technique assigns a small nonzero probability to rare events that are missing in the training data E.g., Laplace smoothing assumes every

output occurs once more than occurs in the dataset

Smoothing is essential – without it, a zero conditional probability results in P(cj|A)=0

Page 24: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.4 Diagnostics

Naïve Bayes advantages Handles missing values Robust to irrelevant variables Simple to implement Computationally efficient Handles high-dimensional data efficiently Often competitive with other learning algorithms Reasonably resistant to overfitting

Naïve Bayes disadvantages Assumes variables are conditionally independent

Therefore, sensitive to double counting correlated variables In its simplest form, used only for categorical variables

Page 25: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.5 Naïve Bayes in R

This section explores two methods of using the naïve Bayes Classifier Manually compute probabilities from

scratch Tedious with many R calculations

Use naïve Bayes function from e1071 package

Much easier – starts on page 222

Example: subscribing to term deposit

Page 26: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.5 Naïve Bayes in R

Get data and e1071 package> setwd("c:/data/rstudio/chapter07")> sample<-read.table("sample1.csv",header=TRUE,sep=",")> traindata<-as.data.frame(sample[1:14,]) > testdata<-as.data.frame(sample[15,]) > traindata #lists train data> testdata #lists test data, no Enrolls variable> install.packages("e1071", dep = TRUE)> library(e1071) #contains naïve Bayes function

Page 27: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.2 Naïve Bayes7.2.5 Naïve Bayes in R

Perform modeling

> model<-naiveBayes(Enrolls~Age+Income+JobSatisfaction+Desire,traindata) > model # generates model output> results<-predict(model,testdata) > Results # provides test prediction

Using a Laplace parameter gives same result

Page 28: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.3 Diagnostics of Classifiers

The book covered three classifiers Logistic regression, decision trees, naïve

Bayes Tools to evaluate classifier performance

Confusion matrix

Page 29: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.3 Diagnostics of Classifiers Bank marketing example

Training set of 2000 records Test set of 100 records, evaluated below

Page 30: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.3 Diagnostics of Classifiers Evaluation metrics

Page 31: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.3 Diagnostics of Classifiers Evaluation metrics on bank marketing 100

test set

poor

poor

Page 32: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.3 Diagnostics of Classifiers ROC curve: good for evaluating binary detection

Bank marketing: 2000 training set + 100 test set> banktrain<-read.table("bank-sample.csv",header=TRUE,sep=",") > drops<-c("balance","day","campaign","pdays","previous","month") > banktrain<-banktrain[,!(names(banktrain) %in% drops)]

> banktest<-read.table("bank-sample-test.csv",header=TRUE,sep=",")> banktest<-banktest[,!(names(banktest) %in% drops)]> nb_model<-naiveBayes(subscribed~.,data=banktrain)> nb_prediction<-predict(nb_model,banktest[,-ncol(banktest)],type='raw')> score<-nb_prediction[,c("yes")] > actual_class<-banktest$subscribed=='yes'> pred<-prediction(score,actual_class) # code problem

Page 33: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.3 Diagnostics of Classifiers ROC curve: good for evaluating binary

detection Bank marketing: 2000 training set + 100 test set

Page 34: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

7.4 Additional Classification Methods

Ensemble methods that use multiple models Bagging: bootstrap method that uses

repeated sampling with replacement Boosting: similar to bagging but iterative

procedure Random forest: uses ensemble of decision

trees These models usually have better

performance than a single decision tree Support Vector Machine (SVM)

Linear model using small number of support vectors

Page 35: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

Summary How to choose a suitable classifier

among Decision trees, naïve Bayes, & logistic

regression

Page 36: Data Science and Big Data Analytics Chap 7: Adv Analytical Theory and Methods: Classification Charles Tappert Seidenberg School of CSIS, Pace University

Midterm Exam – 10/28/156:10-9:00 – 2 hours, 50 minutes

30% - Clustering: k-means example 30% - Association Rules: store

transactions 30% - Regression: simple linear

example 10% - Ten multiple choice questions Note: for each of the three main

problems Manually compute algorithm on small

example Complete short answer sub questions


Recommended