Upload
others
View
27
Download
1
Embed Size (px)
Citation preview
The phases in which classifier’s design can be divided
are reflected in WEKA’s Explorer structure:
◦ Data pre-processing (filtering) and representation
◦ Supervised Learning/Classification, or
◦ Unsupervised Learning / Clustering, or
◦ Rule-based classification
◦ Visualization
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ )
Introduction to Weka
Online course on WEKA (http://weka.wakaito.ac.nz)
5 lessons of 6 modules each (7-10 minutes +
exercises): about 15 hours of total engagement
including time for assignments.
Lessons are also accessible on YouTube
Weka
WEKA lets one choose among a huge number of
modules which implement the main steps necessary
to develop a classification/data mining application
(filters, classifiers, clustering algorithms, graphical
analysis of results)
Modules can either be accessed:
◦ using a command-line interface
◦ through one of the GUIs provided by Weka
◦ Within one’s own code as functions of a JAVA
package
Weka
WEKA includes three GUIs:
◦ Explorer
Interactive pre-processing, data classification and algorithm configuration
◦ Experimenter
Development of applications as cascaded modules and environment for organizing multiple executions of algorithms (on different data sets or using different random seeds) with final generation of a summary of the results.
◦ Knowledge Flow
GUI for ‘visual’ building of an application
as well as a Command-Line Interface from which it is possible to run the different modules as independent applications.
Introduction to Weka
Explorer
◦ Main functions and options
An example
◦ Classification of 2-dimensional data (easy to
visualize!)
Bayes Classifiers
◦ Naïve Bayes
Though it is based on the hypothesis of data independence,
which is very rarely true, it often has reasonable
performances and is very often taken as «baseline
reference» for more sophisticated classifiers
An example
Classify the dataset saved in the files
circletrain.arff // training set
circletest.arff // test set
This is a binary problem: one needs to distinguish between
patterns belonging to 2 classes.
Problem
distinguish the points belonging to:
1. circle of radius 1, centered in 0,0
2. square with sides of length L=2.5,
centered in (0,0) from which region
1 has been removed.
Recalling what we said in the previous lesson, the
classical learning scheme based on
training/(validation)/test sets may still provide very
inaccurate estimates of a classifier’s performance.
• Too few data
• Data unevenly distributed among classes and
between the training and test set, with some over-
or under-represented class
• Presence of outliers
N-fold Crossvalidation
In the circle/square example the training and test sets
have been sampled under the best possible conditions:
• Samples drawn randomly according to a uniform
distribution for both x and y coordinates => x and y
are statistically independent (that is why the naïve
Bayes classifier works reasonably well)
• No noise added to features (x and y)
• No wrong/noisy expected output/label (aka teaching
input)
N-fold Crossvalidation
Still we have
• 53/47 (circle/square) training patterns … and 60/40
(circle/square) testing patterns versus an expected
ratio of
/(6.25-) 3.14/3.11 = 1.01
It is no surprise we obtain different results on the
training and test sets …
And that different random drawings of the training and
test sets also lead to rather different estimates of the
classifier’s performance.
N-fold Crossvalidation
What improvements can we introduce ?
• Stratification: the drawing is made such that the
distribution of samples of different classes is the
same in each set
This solves one problem… but cannot do much when
data are few and different drawings can apparently tell
different stories. In particular, one outlier obviously has
a much stronger impact on the estimates, the smaller
the sample size.
N-fold Crossvalidation
After all:
• We have just a small data set
• If we use all data for training, we will definitely
overfit data.
So what can we do if we want to take the most from
our data ?
N-fold Crossvalidation
• Divide the data set into N subsets
• For N times:
• Leave one subset out and use it as test set, training
a classifier using the other N-1 as test set
• Estimate the performance of the classifier over the
union of the results obtained on the N disjoint test
sets.
Extreme case: ‘leave-one-out’, in which one has as many
folds as samples.
N-fold Crossvalidation
Observations
• All data used as test when not involved in training
• The procedure can be used only to evaluate the
method (estimate a classifier’s error, compare it to
others) but not to design a classifier: we have
produced N different classifiers, none of which has a
reason for being preferred to the others.
N-fold Crossvalidation
The decision taken by a binary classifier produces one of
four possible outcomes:
True Positive (TP): decision= 1, correct decision C=1
True Negative (TN): decision= 0, correct decision C=0
False Positive (FP): decision= 1, correct decision C=0
False Negative (FN): decision= 0, correct decision C=1
Quality criteria for binary classifiers (see also Prof. Bononi’s notes on Bayesian Prediction)
Given a data set with Nd patterns to which one applies a
binary classifier, then TP+TN+FP+FN = Nd
Accuracy : 100*(TP+TN)/(TP+TN+FP+FN) = (TP+TN)/Nd
Estimate of P(C). Correct decision rate.
Recall / Sensitivity / TP rate: 100 * TP/(TP+FN)
Estimate of P(1|c=1):. Correct decision rate when C=1
Specificity =1-FPRate: 100 * TN/(TN+FP)
Estimate of P(0|c=0). Negative rate when C= 0
Precision / Positive predictivity: 100 * TP/(TP+FP)
Estimates P(C=1|1). Positive rate when decision =1
F-Measure = 2 * Precision * Recall / (Precision + Recall)
Global performance index. F-Measure ϵ [0,1]
Quality criteria for binary classifiers (see also Prof. Bononi’s notes on Bayesian Prediction)
The Confusion Matrix (CM) associated with an N-class classifier is a square NxN matrix whose element Aij represents the number (frequency, if normalized by the number of samples of class i) of patterns belonging to class i classified as belonging to class j.
Proprietà:
• The CM associated with an ideal classifier is diagonal
• Si Ai* equates the number of patterns belonging to class i (or 1, if frequency is used)
• Using the elements Ai* it is possible to compute the quality indices for a binary classifier whose decision is ‘ i / not i ’
• TPi = Aii (Recall if frequency is used)
• FNi = Sj Aij (1- dij ) (Homework: derive TNi and FPi )
Quality criteria for binary classifiers (see also Prof. Bononi’s notes on Bayesian Prediction)
C1 C2 C3 C4 Ni
C1 190 6 2 2 200
C2 16 356 8 20 400
C3 1 1 48 0 50
C4 4 4 1 91 100
Decision
Class
Confusion Matrix
C1 C2 C3 C4 Ni
C1 190 6 2 2 200
C2 16 356 8 20 400
C3 1 1 48 0 50
C4 4 4 1 91 100
Decision
Class
Confusion Matrix
C1 C2 C3 C4
C1 95% 3% 1% 1%
C2 4% 89% 2% 5%
C3 2% 2% 96% 0%
C4 4% 4% 1% 91%
Decision
Class
Confusion Matrix (%)
1 0
1 TP FN
0 FP TN
Normalizing by Np = TP + FN and Nn = row-wise,
respectively, one obtains the TP (FN, FP, TN) rates.
Decision
Class
Confusion Matrix (Binary Classifier)
The general goal for a classifier is to maximize accuracy.
However, that single figure does not tell the whole story.
Only if accuracy 1 we can be sure that the classifier is
actually good.
As an example, consider a classifier that must recognize the
figure ‘8’ (positive) against all others (negatives).
If we have a uniform sample, the priors are
P(n) = 0.9 P(p) = 0.1
So, a “classifier” which always decides 0, has an accuracy of
90%, but it is definitely not a good classifier!
ROC (receiver operating characteristic) curve
In terms of optimization, designing a classifier is a multi-
objective optimization maximize two criteria: Sensitivity
(Recall) and Specificity, i.e.
Maximize TP (Minimize FN)
Minimize FP (Maximize TN)
ROC (receiver operating characteristic) curve
ROC curves are used in signal detection to show tradeoff
between hit rate (TP rate) and false alarm rate (FP rate) over
noisy channels.
Defined when radar theory was first developed.
y axis represents the percentage of true positives in sample
x axis represents the percentage of false positives in sample.
It is an estimate of the actual ROC curve :
y axis represents P(ŷ=1 |y=1) (Sensitivity/ TP rate)
x axis represents P(ŷ=1 |y=0) (FP rate)
its slope is equal to the likelihood ratio used in the Bayes
decision rule (see Prof. Bononi’s slides, page B13)
ROC (receiver operating characteristic) curve
(R)OC (receiver operating characteristic) curve
What we obtain, formally, is not the ROC, but the so-called
OC (operating characteristic) curve.
In fact, there are cases (see Duda-Hart, page …), e.g., with
multi-modal distributions, in which using a single threshold
is not the right way to go for receiver/classifier design…
Still a curve (which may not even be convex) can always be
built according to the same rules.
Imagine the following processing chain is used
V 0/1
The curve is obtained by plotting the points
(FP rate, TP rate)
one would obtain by sweeping the threshold
from + (ŷ = 0 for all patterns, FPrate=TPrate=0)
to - (ŷ = 1 for all patterns, FPrate=TPrate=1)
(R)OC (receiver operating characteristic) curve
signal
Preprocessing Thresholding
(R)OC (receiver operating characteristic) curve
How can we estimate the curve if we cannot “set” the
threshold and we have no explicit model of the classifier
behaviour ?
We can use the results obtained in the test set.
Supposing the classifier (as most do) outputs, for each
pattern, a value V which is closer to 1, the more likely the
pattern is to be a positive one, one can:
1. Sort {Vi , i=1, Np} in descending order
2. For i=1, Np
- move up1/Np if yi = 1
- move 1/Np to the right if yi = 0
(R)OC (receiver operating characteristic) curve
If we use data from a single test set, we will obtain a jagged
curve; if we use cross-validation the curve should be
smoother.
(R)OC (receiver operating characteristic) curve
Using (for instance) the logistic regression, the likelihood
of each observation is the corresponding value of the
logistic function. All observations with >0.5 (located to
the right of the dotted line) are classified as 1.
(R)OC (receiver operating characteristic) curve
All points to the right of the dotted line are classified as
positive, all points to its left are classified as negative.
o’s to the left = TN o’s to the right =FP
x’s to the right = TP x’s to the left = FN
(R)OC (receiver operating characteristic) curve
If we shift the logistic function to the extreme right the dotted line will shift with it and the classifier will always output 0. If we then start sweeping the curve leftwards, the likeliest observations will start to be classified as positive, so the OC curve will grow vertically whenever a positive (x) crosses the dotted line and horizontally whenever a negative (o) does.
(R)OC (receiver operating characteristic) curve
Reasoning with distributions:
(R)OC (receiver operating characteristic) curve
The area under the curve (AUC) thus constructed is a quality
index for the classifier: the more it is close to one, the
better.
In fact, an ideal classifier separates positive from negatives
perfectly.
In that case, the OC Curve will actually be a square, since it
will grow first only vertically until it reaches (FP rate, TP
rate) (0,1), then it will grow only horizontally to reach
(1,1).
Homework: write a simple program (very easy in Matlab)
to compute the AUC for a classifier.
Kappa statistic
Given two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)
Number of successes: sum of entries in diagonal (D)
● Kappa statistic: Dobserved − Drandom
Dperfect − Drandom
measures relative improvement over random predictor
Most probably, same value obtained if AUC is used instead of D.
Quality criteria for binary classifiers (used by WEKA)
Random Predictor
P(a|x) = P(a)
P(b|x) = P(b)
P(c|x) = P(c)
Difference: error measures
Actual target values: a1 a2 …an
Predicted target values: p1 p2 … pn
Mean Squared error
(p1−a1)2 + ... + (pn −an)
2
n
Root Mean Squared (RMS) error
(p1−a1)2 + ... + (pn −an)
2
n
Mostly relevant for regression problems (continuous target
values)
Quality criteria for binary classifiers (used by WEKA)
Difference: error measures
Actual target values: a1 a2 …an
Predicted target values: p1 p2 … pn
Mean Absolute error
|p1−a1| + ... + |pn − an|
n
Less sensitive to outliers.
Quality criteria for binary classifiers (used by WEKA)
Relative errors Relative Absolute error
|p1− a1| + ... + |pn − an|
| a − a1 | + ... + | a − an |
a = average of a1….. an
Relative Squared error
(p1− a1)2 + ... + (pn − an)
2
(a − a1)2 + ... + (a − an)
2
Quality criteria for binary classifiers (used by WEKA)
First two units of the online course (6+6
modules) on Weka: they include a revision of
some concepts we introduced in the previous
lesson.
Slides associated to chapter 5 of the book
“Data mining” by Witten, Frank, and Hall.
Online in our course site.
Suggested readings/videos