Weka ( ) · 2015-06-10 · Weka WEKA includes three GUIs: Explorer Interactive pre-processing, data classification and algorithm configuration Experimenter Development of applications

The phases in which classifier’s design can be divided

are reflected in WEKA’s Explorer structure:

◦ Data pre-processing (filtering) and representation

◦ Supervised Learning/Classification, or

◦ Unsupervised Learning / Clustering, or

◦ Rule-based classification

◦ Visualization

Weka ( http://www.cs.waikato.ac.nz/ml/weka/ )

Introduction to Weka

Online course on WEKA (http://weka.wakaito.ac.nz)

5 lessons of 6 modules each (7-10 minutes +

exercises): about 15 hours of total engagement

including time for assignments.

Lessons are also accessible on YouTube

Weka

WEKA lets one choose among a huge number of

modules which implement the main steps necessary

to develop a classification/data mining application

(filters, classifiers, clustering algorithms, graphical

analysis of results)

Modules can either be accessed:

◦ using a command-line interface

◦ through one of the GUIs provided by Weka

◦ Within one’s own code as functions of a JAVA

package

Weka

WEKA includes three GUIs:

◦ Explorer

Interactive pre-processing, data classification and algorithm configuration

◦ Experimenter

Development of applications as cascaded modules and environment for organizing multiple executions of algorithms (on different data sets or using different random seeds) with final generation of a summary of the results.

◦ Knowledge Flow

GUI for ‘visual’ building of an application

as well as a Command-Line Interface from which it is possible to run the different modules as independent applications.

Introduction to Weka

Explorer

◦ Main functions and options

An example

◦ Classification of 2-dimensional data (easy to

visualize!)

Bayes Classifiers

◦ Naïve Bayes

Though it is based on the hypothesis of data independence,

which is very rarely true, it often has reasonable

performances and is very often taken as «baseline

reference» for more sophisticated classifiers

An example

Classify the dataset saved in the files

circletrain.arff // training set

circletest.arff // test set

This is a binary problem: one needs to distinguish between

patterns belonging to 2 classes.

Problem

distinguish the points belonging to:

1. circle of radius 1, centered in 0,0

2. square with sides of length L=2.5,

centered in (0,0) from which region

1 has been removed.

Recalling what we said in the previous lesson, the

classical learning scheme based on

training/(validation)/test sets may still provide very

inaccurate estimates of a classifier’s performance.

• Too few data

• Data unevenly distributed among classes and

between the training and test set, with some over-

or under-represented class

• Presence of outliers

N-fold Crossvalidation

In the circle/square example the training and test sets

have been sampled under the best possible conditions:

• Samples drawn randomly according to a uniform

distribution for both x and y coordinates => x and y

are statistically independent (that is why the naïve

Bayes classifier works reasonably well)

• No noise added to features (x and y)

• No wrong/noisy expected output/label (aka teaching

input)


Still we have

• 53/47 (circle/square) training patterns … and 60/40

(circle/square) testing patterns versus an expected

ratio of

/(6.25-) 3.14/3.11 = 1.01

It is no surprise we obtain different results on the

training and test sets …

And that different random drawings of the training and

test sets also lead to rather different estimates of the

classifier’s performance.


What improvements can we introduce ?

• Stratification: the drawing is made such that the

distribution of samples of different classes is the

same in each set

This solves one problem… but cannot do much when

data are few and different drawings can apparently tell

different stories. In particular, one outlier obviously has

a much stronger impact on the estimates, the smaller

the sample size.


After all:

• We have just a small data set

• If we use all data for training, we will definitely

overfit data.

So what can we do if we want to take the most from

our data ?


• Divide the data set into N subsets

• For N times:

• Leave one subset out and use it as test set, training

a classifier using the other N-1 as test set

• Estimate the performance of the classifier over the

union of the results obtained on the N disjoint test

sets.

Extreme case: ‘leave-one-out’, in which one has as many

folds as samples.


Observations

• All data used as test when not involved in training

• The procedure can be used only to evaluate the

method (estimate a classifier’s error, compare it to

others) but not to design a classifier: we have

produced N different classifiers, none of which has a

reason for being preferred to the others.


The decision taken by a binary classifier produces one of

four possible outcomes:

True Positive (TP): decision= 1, correct decision C=1

True Negative (TN): decision= 0, correct decision C=0

False Positive (FP): decision= 1, correct decision C=0

False Negative (FN): decision= 0, correct decision C=1

Quality criteria for binary classifiers (see also Prof. Bononi’s notes on Bayesian Prediction)

Given a data set with Nd patterns to which one applies a

binary classifier, then TP+TN+FP+FN = Nd

Accuracy : 100*(TP+TN)/(TP+TN+FP+FN) = (TP+TN)/Nd

Estimate of P(C). Correct decision rate.

Recall / Sensitivity / TP rate: 100 * TP/(TP+FN)

Estimate of P(1|c=1):. Correct decision rate when C=1

Specificity =1-FPRate: 100 * TN/(TN+FP)

Estimate of P(0|c=0). Negative rate when C= 0

Precision / Positive predictivity: 100 * TP/(TP+FP)

Estimates P(C=1|1). Positive rate when decision =1

F-Measure = 2 * Precision * Recall / (Precision + Recall)

Global performance index. F-Measure ϵ [0,1]


The Confusion Matrix (CM) associated with an N-class classifier is a square NxN matrix whose element Aij represents the number (frequency, if normalized by the number of samples of class i) of patterns belonging to class i classified as belonging to class j.

Proprietà:

• The CM associated with an ideal classifier is diagonal

• Si Ai* equates the number of patterns belonging to class i (or 1, if frequency is used)

• Using the elements Ai* it is possible to compute the quality indices for a binary classifier whose decision is ‘ i / not i ’

• TPi = Aii (Recall if frequency is used)

• FNi = Sj Aij (1- dij ) (Homework: derive TNi and FPi )


C1 C2 C3 C4 Ni

C1 190 6 2 2 200

C2 16 356 8 20 400

C3 1 1 48 0 50

C4 4 4 1 91 100

Decision

Class

Confusion Matrix

C1 C2 C3 C4 Ni

C1 190 6 2 2 200

C2 16 356 8 20 400

C3 1 1 48 0 50

C4 4 4 1 91 100

Decision

Class

Confusion Matrix

C1 C2 C3 C4

C1 95% 3% 1% 1%

C2 4% 89% 2% 5%

C3 2% 2% 96% 0%

C4 4% 4% 1% 91%

Decision

Class

Confusion Matrix (%)

1 0

1 TP FN

0 FP TN

Normalizing by Np = TP + FN and Nn = row-wise,

respectively, one obtains the TP (FN, FP, TN) rates.

Decision

Class

Confusion Matrix (Binary Classifier)

The general goal for a classifier is to maximize accuracy.

However, that single figure does not tell the whole story.

Only if accuracy 1 we can be sure that the classifier is

actually good.

As an example, consider a classifier that must recognize the

figure ‘8’ (positive) against all others (negatives).

If we have a uniform sample, the priors are

P(n) = 0.9 P(p) = 0.1

So, a “classifier” which always decides 0, has an accuracy of

90%, but it is definitely not a good classifier!

ROC (receiver operating characteristic) curve

In terms of optimization, designing a classifier is a multi-

objective optimization maximize two criteria: Sensitivity

(Recall) and Specificity, i.e.

Maximize TP (Minimize FN)

Minimize FP (Maximize TN)


ROC curves are used in signal detection to show tradeoff

between hit rate (TP rate) and false alarm rate (FP rate) over

noisy channels.

Defined when radar theory was first developed.

y axis represents the percentage of true positives in sample

x axis represents the percentage of false positives in sample.

It is an estimate of the actual ROC curve :

y axis represents P(ŷ=1 |y=1) (Sensitivity/ TP rate)

x axis represents P(ŷ=1 |y=0) (FP rate)

its slope is equal to the likelihood ratio used in the Bayes

decision rule (see Prof. Bononi’s slides, page B13)


(R)OC (receiver operating characteristic) curve

What we obtain, formally, is not the ROC, but the so-called

OC (operating characteristic) curve.

In fact, there are cases (see Duda-Hart, page …), e.g., with

multi-modal distributions, in which using a single threshold

is not the right way to go for receiver/classifier design…

Still a curve (which may not even be convex) can always be

built according to the same rules.

Imagine the following processing chain is used

V 0/1

The curve is obtained by plotting the points

(FP rate, TP rate)

one would obtain by sweeping the threshold

from + (ŷ = 0 for all patterns, FPrate=TPrate=0)

to - (ŷ = 1 for all patterns, FPrate=TPrate=1)


signal

Preprocessing Thresholding


How can we estimate the curve if we cannot “set” the

threshold and we have no explicit model of the classifier

behaviour ?

We can use the results obtained in the test set.

Supposing the classifier (as most do) outputs, for each

pattern, a value V which is closer to 1, the more likely the

pattern is to be a positive one, one can:

1. Sort {Vi , i=1, Np} in descending order

2. For i=1, Np

- move up1/Np if yi = 1

- move 1/Np to the right if yi = 0


If we use data from a single test set, we will obtain a jagged

curve; if we use cross-validation the curve should be

smoother.


Using (for instance) the logistic regression, the likelihood

of each observation is the corresponding value of the

logistic function. All observations with >0.5 (located to

the right of the dotted line) are classified as 1.


All points to the right of the dotted line are classified as

positive, all points to its left are classified as negative.

o’s to the left = TN o’s to the right =FP

x’s to the right = TP x’s to the left = FN


If we shift the logistic function to the extreme right the dotted line will shift with it and the classifier will always output 0. If we then start sweeping the curve leftwards, the likeliest observations will start to be classified as positive, so the OC curve will grow vertically whenever a positive (x) crosses the dotted line and horizontally whenever a negative (o) does.


Reasoning with distributions:


The area under the curve (AUC) thus constructed is a quality

index for the classifier: the more it is close to one, the

better.

In fact, an ideal classifier separates positive from negatives

perfectly.

In that case, the OC Curve will actually be a square, since it

will grow first only vertically until it reaches (FP rate, TP

rate) (0,1), then it will grow only horizontally to reach

(1,1).

Homework: write a simple program (very easy in Matlab)

to compute the AUC for a classifier.

Kappa statistic

Given two confusion matrices for a 3-class problem:

actual predictor (left) vs. random predictor (right)

Number of successes: sum of entries in diagonal (D)

● Kappa statistic: Dobserved − Drandom

Dperfect − Drandom

measures relative improvement over random predictor

Most probably, same value obtained if AUC is used instead of D.

Quality criteria for binary classifiers (used by WEKA)

Random Predictor

P(a|x) = P(a)

P(b|x) = P(b)

P(c|x) = P(c)

Difference: error measures

Actual target values: a1 a2 …an

Predicted target values: p1 p2 … pn

Mean Squared error

(p1−a1)2 + ... + (pn −an)

2

n

Root Mean Squared (RMS) error

(p1−a1)2 + ... + (pn −an)

2

n

Mostly relevant for regression problems (continuous target

values)


Difference: error measures

Actual target values: a1 a2 …an

Predicted target values: p1 p2 … pn

Mean Absolute error

|p1−a1| + ... + |pn − an|

n

Less sensitive to outliers.


Relative errors Relative Absolute error

|p1− a1| + ... + |pn − an|

| a − a1 | + ... + | a − an |

a = average of a1….. an

Relative Squared error

(p1− a1)2 + ... + (pn − an)

2

(a − a1)2 + ... + (a − an)

2


First two units of the online course (6+6

modules) on Weka: they include a revision of

some concepts we introduced in the previous

lesson.

Slides associated to chapter 5 of the book

“Data mining” by Witten, Frank, and Hall.

Online in our course site.

Suggested readings/videos

Documents

Weka ( ) · 2015-06-10 · Weka WEKA includes three GUIs: Explorer Interactive pre-processing, data classification and algorithm configuration Experimenter Development of applications