Exercises with PRTools - Javeriana Calicic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:destino:0... · Exercises with PRTools ASCI ... C. Veenman Information and Communication

Exercises with PRTools

ASCI APR Course, May 2008

R.P.W. Duin, M. Loog, D. de Ridder,

D.M.J. Tax, C. Veenman

Information and Communication Theory Group, Delft University of TechnologyInformatics Institute, Faculty of Science, University of Amsterdam

−10 −8 −6 −4 −2 0 2 4 6−10

−5

0

5

−10 −5 0 5

−8

−6

−4

−2

0

2

4

6

8 applebananapear

Introduction

The aim of this set of exercises is to assist the reader in getting acquainted with PRTools,a Matlab toolbox for pattern recognition. It is a prerequisite to have a global knowledge onpattern recognition, to have read the introductory part of the PRTools manual and to haveaccess to this manual during the study of the exercises. Moreover, the reader needs to havesome experience with Matlab and should regularly study the help texts provided with thePRTools commands (e.g. help gendatc).

The exercises should give some insight into the toolbox. They are not meant to explain indetail how the tools are constructed and, thereby, they do not reach the level that enables thestudent to add new tools to PRTools, using its specific classes dataset and mapping.

It is left to the responsibility of the reader to study the exercises using various datasets. Theycan be either generated by one of the routines in the toolbox or they should be loaded from aspecial dataset directory. In section 13 this is explained further with examples of both, artificialdata as well as real world data. First the Matlab commands are given, next scatter plots ofsome of the sets are shown. Note that not all the arguments in the commands presented arecompulsory. It is necessary to refer to these pages regularly in order to find suitable problemsfor the exercises.

In order to build pattern recognition systems for real world (raw) datasets, e.g. images asthey are grabbed by a camera, preprocessing and the measurement of features is necessary.The growing measurement toolbox MeasTools is designed for that. Here it is unavoidablethat students write their own low level routines as at this moment the collection of featuremeasuring tools is insufficient. As no MeasTools manual is available yet, students shouldread the online documentation and the additional material which may be supplied during acourse.

Don’t forget to study the exercises presented in the manual and examples available underPRTools (e.g. prex_cleval)!

**************************

The exercises assume that the data collections prdatasets, prdatafiles and Coursedata

are available. The last directory contains also some experimental commands not available inthe standard PRTools distribution.

Version 4.1 of PRTools contains some new facilities that may confuse the user. Theprprogress command controls the reporting of long running commands. It may irritatethe user and may even sometimes crash (especially in the Java interface). It may be switchedof by prprogress off.

Some commands may generate many warnings, especially when no class priors are set in adataset. One solution is to switch of the PRTools warning mechanism by prwarning(0).A better way is to set class priors. Eg. a = setprior(a,getprior(a)) sets the class priorsaccording to class frequencies if they are not yet defined.

2

Contents

1 Introduction 4

2 Classifiers 10

3 Neural network classifiers 16

4 Classifier evaluation and error estimation 18

5 Cluster Analysis and Image Segmentation 23

6 Dissimilarity Based Representations 26

7 Feature Spaces and Feature Reduction 29

8 Complexity and Support Vector Classifiers 34

9 One-Class Classifiers 36

10 Classifier combining 39

11 Boosting 42

12 Image Segmentation and Classification 44

13 Summary of the methods for data generation and available data sets 47

3

1 Introduction

Example 1. Datasets

PRTools entirely deals with sets of objects represented by vectors in a feature space. Thecentral data structure is a so-called dataset. It consist of a matrix of size m × k; m rowvectors representing the objects given by k features each. Attached to this matrix is a set ofm labels (strings or numbers), one for each object and a set of k feature names (also stringsor numbers), one for each feature. Moreover, a set of prior probabilities, one for each class, isstored. Objects with the same label belong to the same class. In most help files in PRTools,a dataset is denoted by A. Almost all routine can handle multi-class objects. Some usefulroutines to handle datasets are:

dataset Define dataset from data matrix and labels

gendat Generate a random subset of a dataset

genlab Generate dataset labels

seldat Select a specify subset of a dataset

setdat Define a new dataset from an old one by replacing its data

getdata Retrieve data from dataset

getlab Retrieve object labels

getfeat Retrieve feature labels

renumlab Convert labels to numbers

Sets of objects may be given externally or may be generated by one of the data generationroutines in PRTools (see section 13). Their labels may be given externally or may be theresults of a classification or a cluster analysis. A dataset containing 10 objects with 5 randommeasurements can be generated by:

>> data = rand(10,5);

>> a = dataset(data)

10 by 5 dataset with 0 classes: [ ]

In this example no labels are supplied, therefore no classes are detected. Labels can be addedto the dataset by:

>> labs = [1 1 1 1 1 2 2 2 2 2]’; % labs should be a column vector

>> a = dataset(a,labs)

10 by 5 dataset with 2 classes: [5 5]

Note that the labels have to be supplied as a column vector. A simple way to assign labels toa dataset is offered by the routine genlab in combination with the Matlab char command:

>> labs = genlab([4 2 4],char(’apple’,’pear’,’banana’))

>> a = dataset(a,labs)

10 by 5 dataset with 3 classes: [4 4 2]

Note that the order of the classes has changed. Use the routines getlab and getfeat toretrieve the object labels and the feature labels of a. The fields of a dataset can be made

4

visible by the converting it to a structure, e.g.:

>> struct(a)

data: [10x5 double]

lablist: [3x6 char]

nlab: [10x1 double]

labtype: ’crisp’

targets: []

featlab: [5x1 double]

featdom: [] [] [] [] []

prior: []

cost: []

objsize: 10

featsize: 5

ident: 10x1 cell

version: [1x1 struct] ’05-Apr-2005 18:57:19’

name: []

user: []

In the on-line information on datasets (help datasets, also printed in the PRTools manual)the meaning of these fields is explained. Each field may be changed by a set-command, e.g.

>> b = setdata(a,rand(10,5));

Field values can be retrieved by a similar get-command, e.g.

>> classnames = getlablist(a)

In nlab an index is stored for each object to the list of class names lablist. Note that thislist is alphabetically ordered. The size of a dataset can be found by both, size and getsize:

>> [m,k] = size(a);

>> [m,k,c] = getsize(a);

The number of objects is returned in m, the number of features in k and the number of classesin c. The class prior probabilities are stored in prior. It is by default set to the classfrequencies if the field is empty. Data in a dataset can also be retrieved by double(a) or moresimple by +a.

1.1 Have a look of the help-information of seldat. Notice that it has many input parameters.In most cases you can ignore input parameters of functions that are of no interest to you. Thedefault values are often good enough. Use the routine to extract the banana class from a andcheck this by inspecting the result of +a.

Datasets can be manipulated in many ways comparable with Matlab matrices. So [a1; a2]

combines two datasets, provided that they have the same number of features. The feature setmay be extended by [a1 a2] if a1 and a2 have the same number of objects.

1.2 Generate 3 new objects of the classes ’apple’ and ’pear’ and add them to the dataseta. Check if the class sizes change accordingly.

1.3 Add a new, 6th feature to the whole dataset a.

5

Another way to inspect a dataset is to make a scatterplot of the objects in the dataset. Forthis the function scatterd is supplied. This plots each object in a dataset in a 2D graph,using a coloured marker when class labels are supplied. When more than two features arepresent in the dataset, only the first two are used. For obtaining a scatterplot of two otherfeatures they have to be explicitly extracted first, e.g. a1 = a(:,[2 5]);. With an extraoption ’legend’ one can add a legend to the figure, showing which markers indicate whichclasses.

1.4 Use scatterd to make a scatterplot of the features 2 and 5 of dataset a. Try it alsousing the ’legend’ option.

1.5 Next, use scatterdui to make a scatterplot of a and use its buttons to select features.(Note that ’legend’ is not a valid option here.)

1.6 It is also possible to create 3D scatterplots. Make a 3-dimensional scatterplot byscatterd(a,3) and try to rotate it by the mouse after pressing the right toolbar button.

1.7 Use one of the procedures described on page 42 and following to create an artificialdataset of 100 objects. Make a scatterplot. Repeat this a few times.

Exercise 1. Scatterplot

Load the 4-dimensional Iris dataset by a = iris and make scatterplots of all feature com-binations using the gridded option of scatterd. Try also all feature combination usingscatterdui.

Plot in a separate figure the one-dimensional feature densities by plotf. Identify visuallythe best combination of two features. Create a new dataset b that contains just these twofeatures. Create a new figure by the figure command and plot a scatterplot of b.

Exercise 2. Mahalanobis distance (optional)

Use the distmaha command to compute the Mahalanobis distances between all pairs of classesin the iris dataset. Repeat this for the best two features just selected. Can you find a wayto test whether this is really the best feature pair according the Mahalanobis distance?

Exercise 3. Generate your own dataset (optional)

Generate a dataset that consists of two 2-D uniformly distributed classes of objects using therand command. Transform the sets such that for the [xmin xmax; ymin ymax] intervalsthe following holds: [0 2; -1 1] for class 1 and [1 3; 1.5 3.5] for class 2. Generate50 objects for each class. An easy way is to do this for x and y coordinates separately andcombine them afterwards. Label the features by ’area’ and ’perimeter’.

Check the result by scatterd and by retrieving object labels and feature labels.

Exercise 4. Enlarge an existing dataset (optional)

Generate a dataset using gendatb containing 10 objects per class. Enlarge this dataset to100 objects per class by generating more data using the gendatk and gendatp commands.Compare the scatterplots with a scatterplot of 100 objects per class directly generated bygendatb. Explain the difference.

Example 2. Density estimation

6

The following routines are available for density estimation:

gaussm Normal distribution

parzenm Parzen density estimation

knnm K-nearest neighbour density estimation

They are programmed as a mapping. Details of mappings are discussed later. The followingtwo steps are always essential for a mapping: the estimation is built, or trained, using atraining set, e.g. by:

>> a = gauss(100)

Gaussian Data, 100 by 1 dataset with 1 classes: [100]

Which is a 1-dimensional normally distributed dataset of 100 points with mean 0.

>> w = gaussm(a)

Mixture of Gaussians, 1 to 1 trained mapping --> normal map

The trained mapping w now contains all information needed for computing densities of givenpoints, e.g.

>> b = [-2:0.1:2]’;

Now we will measure for the points defined by b the density according to w (which is a densityestimator based on the dataset a):

>> d = map(b,w) 41 by 1 dataset with 0 classes: [41]

The result may be listed on the screen by [+b +d] (coordinates and densities) or plotted by:

>> plot(+b,+d)

2.1 Plot the densities estimated by parzenm and knnm in separate figures. These routinesneed sensible parameters. Try a few values for the smoothing parameter and the number ofnearest neighbours.

Example 3. Create a dataset from a set of images

Load an image dataset, e.g. kimia. Use the struct command to inspect its featsize field.As this dataset consists of object images (each object in the dataset is an image) the imagesizes have to be known and are stored in this field. Use the show command to visualize thisimage datasets.

The immoments command generates out of a dataset with object images a set of moments asfeatures. Compute the Hu moments and study the scatterplot by scatterdui.

Exercise 5. Compute image features

Some PRTools command operate on images stored in datasets, see help prtools. A com-mand like datfilt and dataim may be used to transform object images. Think of a wayto compute the area and the contour length of the blobs in the kimia dataset. Display thescatterplot.

Exercise 6. Density plots (optional)

7

Generate a 2-dimensional 2-class dataset by gendatb of 50 points per class. Estimate thedensities by each of the methods from Example 2.

Make in three figures a 2D scatterplot by scatterd. Different from the above 1-dimensionalexample, a ready made density plotting routine plotm can be used for drawing iso-densitylines in the scatterplot. Plot them on three figures by using command plotm(w). Try also3-d plots by plotm(w,3). Note that plotm always needs first a scatterplot to find the domainwhere the density has to be computed.

Exercise 7. Nearest Neighbor Classification (optional)

Write your own function for the nearest neighbour error estimation: e = nne(d) in whichthe incoming parameter d is a labeled distance matrix obtained by d = distm(b,a), wherea and b are labeled datasets. The objects of datasets a and b should be represented in thesame feature space. The resulting d is again a dataset. The objects of d are representedby distances between b and a. Labels of d can be retrieved by object lab = getlab(d),features – by feat lab = getfeat(d).

By the definition of the nearest neighbour rule, the label of each object in the test set has tobe compared with the label of its nearest neighbour in the training set. In this exercise a (b)is playing a role of a training (test) set. The number of differences between two label sets canbe counted by n = nlabcmp(object lab,feat lab).

The nne routine thereby has the following steps:

1. Create a vector L with as many elements as d has objects. L(i) = j, where j is theindex of the nearest neighbour of row object i. This index of the closest object canbe found by [dd,j] = min(d(i,:));

2. Use nlabcmp to count the differences between the true labels of the objects corre-sponding to the rows given by object lab and the labels of the nearest neighboursfeat lab(L,:).

3. Normalise and return the error.

4. If the training set a and the test set b are identical (e.g. d = distm(a,a)), nne shouldreturn 0 because each object is its own nearest neighbour. Modify your routine in sucha way that it returns the ’leave-one-out’ error if it is called by e = nne(d,’loo’).The leave-one-out error is the error made on a set of objects if for each object underconsideration the object itself is excluded from the set at the moment it is evaluated.In this case not the smallest d(i,j) on row i has to be found (which should be on thediagonal), but the next one.

Inspect some 2D datasets by scatterd and estimate the nearest neighbour error by nne.

Running Exercise 1. NIST Digits

Several datasets of of handwritten digits are available. The command nist32 loads binaryimages of size 32x32 as a dataset of 1024 features. In several ways features can be extracted,e.g. immoments computes by default the coordinates of the mean.

Load a dataset A of four digits, e.g. 0,3,5 and 8. Create a subset B with 25 objects per class.Use the show command to visualize this dataset.

8

Compute out of B a new dataset C with just two features, e.g. two moments. Make ascatterplot of C.

9

2 Classifiers

Example 4. Mappings and Classifiers

In PRTools datasets are transformed by mappings. These are procedures that map a setof objects form one space into another. Examples are feature selection, feature rescaling,rotations of the space, classification. e.g.

>> w = cmapm(10,[2 4 7])

FeatureSelection, 10 to 3 fixed mapping --> cmapm

w is herewith defined as a mapping of 10-dimensional space to a 3-dimensional space byselecting the features 2, 4 and 7. Its name is ’FeatureSelection’ and its executing routine,when it is applied to data is ’w’. It may be applied as follows:

>> a = gauss(100,zeros(1,10))

Gaussian Data, 100 by 10 dataset with 1 class: [100]

>> b = map(a,w)


In a mapping (we use almost everywhere the variable w for mappings) various informationis stored, like the dimensionalities of input and output space, parameters that define thetransformation and the routine that is used for executing the transformation. Use struct(w)

to see all fields.

Often a mapping has to be trained, i.e. it has to be adapted to a training set by someestimation or training procedures to minimise some error for the training set. An exampleis the principal component analysis that performs an orthogonal rotation according to thedirections with main variance in a given dataset:

>> w = pca(a,2)

Principal Component Analysis, 10 to 2 trained mapping -->

affine

This just defines the mapping (’trains’ it by a) for finding the first 2 principal components.The fields of a mapping can be shown by struct(w). In the PRTools-manual or by’help mappings’ more information on mappings can be found. The mapping w may beapplied to a or to any other 10-dimensional dataset by:

>> b = map(a,w)


Instead of the routine map also the ’*’ operator may be used for applying mappings to datasets:

>> b = a*w


Note that the size of the variables a (100 × 10) and w (10 × 2) are such that the innerdimensionalities cancel in the computation of b, like in all Matlab matrix operations.

The ’*’ operator may also be used for training. a*pca is equivalent with pca(a) anda*pca([],2) is equivalent with pca(a,2). As a result, an ’untrained’ mapping can be storedin a variable: w = pca([],2). They may, thereby, also be passed as an argument in a functioncall. The advantages of this possibility will be shown later.

10

A special case of a mapping is a classifier. It maps a dataset on distances to a discriminantfunction or on class posterior probability estimates. They can be used in two modes: ’un-trained’ and ’trained’. When applied to a dataset, in the ’untrained’ mode the dataset is usedfor training and a classifier is generated, while in the ’trained’ mode the dataset is classified.Unlike mappings, fixed classifiers don’t exist. Some important classifiers are:

fisherc Fisher classifier

qdc Quadratic classifier assuming normal densities

udc Quadratic classifier assuming normal uncorrelated densities

ldc Linear classifier assuming normal densities with equal covariance matrices

nmc Nearest mean classifier

parzenc Parzen density based classifier

knnc k-nearest neighbour classifier

treec Decision tree

svc Support vector classifier

lmnc Neural network classifier trained by the Levenberg-Marquardt rule

4.1 Generate a dataset a by gendath and compute the Fisher classifier by w = fisherc(a).Make a scatter plot of a and plot the classifier by plotc(w). Classify the training set byd = map(a,w) or d = a*w. Show the result on the screen by +d.

4.2 What is displayed is the value of the sigmoid function of the distances to theclassifier. This function maps the distances to the classifier from the (− inf, + inf) in-terval on the (0,1) interval. The latter can be interpreted as posterior probabilities.The original distances can be retrieved by +invsigm(d). This may be visualised byplot(+invsigm(d(:,1)),+d(:,1),’*’), which shows the shape of the sigmoid function (dis-tances along the horizontal axis, sigmoid values along the vertical axis).

4.3 During training distance based classifiers are appropriately scaled such that the posteriorprobabilities are optimal for the training set in the maximum likelihood sense. In multi-classproblems a normalisation is needed to take care that the posterior probabilities sum to one.This is enabled by classc. So classc(map(a,w)), or a*w*classc maps the dataset a onthe trained classifier w and normalises the resulting posterior probabilities. If we includetraining as well then this can be written in a one-liner as p = a*(a*fisherc)*classc. (Tryto understand this expression: between the brackets the classifier is trained. The resultis applied on the same dataset). Note that because the sigmoid-based normalisation is amonotonous transformation, it does not alter the class membership of data samples in themaximum-aposteriori probability (MAP) sense.

This may be visualized by computing classifier distances, sigmoids and normalized posteriorprobability estimates for a multi-class problem as follows. Load the 80x dataset by a = x80.Compute the Fisher classifier by w = a*fisherc, classify the training set by d = a*w, andcompute p = d*classc. Display the various output values by +[d p]. Note that the objectconfidences over the first 3 columns don’t sum to one and that they are normalised in the last3 columns to proper posterior probability estimates.

4.4 Density based classifiers like qdc find after training (w = qdc(a), or w = a*qdc),density estimators for all classes in the training set. Estimates for objects in some dataset b

can be found by d = b*w. Again, posterior probability estimates are found after normalisation

11

by classc: p = d*classc. Have a look at +[d p] to see the estimates for the class densityand the related posterior probabilities.

Example 5. Classifiers and discriminant plots.

This example illustrates how to plot decision boundaries in 2D scatter plots by plotc.

5.1 Generate a dataset, make a scatter plot, train and plot some classifiers by

>> a = gendath([20 20]);

>> scatterd(a)

>> w1 = ldc(a);

>> w2 = nmc(a);

>> w3 = qdc(a);

>> plotc({w1,w2,w3})

Plot in a new scatter plot of a a series of classifiers computed by the k-NN rule (knnc) forvarious values of k between 1 on 10. Look at the influence of the neighbourhood size on theclassification boundary. Check the boundary for k=1.

5.2 A special option of plotc colours the regions assigned to different classes:

>> a = gendatm

>> w = a*qdc

>> scatterd(a) % defines the plotting domain of interest

>> plotc(w,’col’) % colours the class regions

>> hold on % necessary to preserve the plot

>> scatterd(a) % plots the data again in the plot

Plots like these are influenced by the grid size used for computing the classifier outputs in thescatter plot. By default it is 30 × 30 (grid size is 30). The grid size value can be retrievedand set by gridsize. Study its influence by setting the gridsize to 100 (or even larger) andrepeating the above commands. Use each time a new figure, so results can be compared. Notethe influence on the computation time.

Exercise 8. Normal densities based classifiers.

Take the features 2 and 3 of the Iris dataset. Make a scatter plot and plot in it the normaldensities, see also example 2 and/or exercise 6. Compute the quadratic classifier based onnormal densities (qdc) and plot it on top of this. Repeat this for the uncorrelated (udc)and the linear classifiers (ldc) based on normal distributions, but plot them on top of thecorresponding density estimation plots.

Exercise 9. Linear classifiers (optional)

Use the same dataset for comparing some linear classifiers: the linear normal distributionbased classifier (ldc) , nearest mean (nmc), Fisher (fisherc) and the support vector classifier(svc). Plot them on top of each other, in different colours, in the same scatter plot. Don’tplot density estimates now.

Exercise 10. Non-linear classifiers (optional)

Generate a dataset by gendath and compare in the scatter plots the quadratic normal densitiesbased classifier (qdc) with the Parzen classifier (parzenc) and the 1-nearest neighbour rule(knnc([],1)). Try also a decision tree (treec).

12

Example 6. Training and test sets (optional)

The performance of a classifier w can be tested by an independent test set, say b. If sucha set is available the routine testc may be used to count the number of errors. Note thatthe routine classc just converts classifier outcomes to posterior probabilities, but does notchange the class assignments. So b*w*classc*testc produces the same result as b*w*testc.

6.1 Generate a training set a of 20 objects per class by gendath and a test set b of 1000objects per class. Compute the performance of the Fisher classifier by b*(a*fisherc)*testc.Repeat this for some other classifiers. For which classifiers do the errors on the training setand test set differ most? Which classifier performs best?

Example 7. Classifier evaluation

In PRTools a dataset a can be split into a training set b and a test set c by the gendat

command, e.g. [b,c] = gendat(a,0.5). In this case, for each class 50% of the objectsare randomly chosen for dataset b and the remaining objects are stored in dataset c. Aftercomputing a classifier by the training set, e.g. w = b*fisherc, the test set c can be classifiedby d = c*w. For each object, the label of the class with the highest confidence, or posteriorprobability, can be found by d*labeld. E.g.:

>> a = gendath;

>> [b,c] = gendat(a,0.9)

Higleyman Dataset, 90 by 2 dataset with 2 classes: [45 45]

Higleyman Dataset, 10 by 2 dataset with 2 classes: [5 5]

>> w = fisherc(b); % the class names (labels) of b are stored in w

>> getlabels(w) % this rutine shows labels (classes labels are 1

% and 2)

>> d = c*w; % classify test set

>> lab = d*labeld; % get the labels of the test objects

>> disp([+d lab]) % show the posterior probabilities and labels

Note that in the last displayed column (lab) the labels of the classes with the highest classifieroutputs are stored. The average error in a test set can be directly computed by testc:

>> d*testc

which may also be written as testc(d) or testc(c,w) (or c*w*testc).

Exercise 11. Error limits of K-NN rule and Parzen classifier

Take a simple dataset like the Higleyman classes (gendath) and generate a small training set(e.g. 25 objects per class) and a large test set (e.g. 200 objects per class). Recall what thetheory predicts for the limits of the classification error of the k-NN rule and the Parzen classifieras a function of the number of neighbours k and the smoothing parameter h. Estimate andplot the corresponding error curves and verify the theory. How can you estimate the Bayeserror of the Highleyman dataset if it is known that the classes are normally distributed? Tryto explain the differences between the theory and your results.

Exercise 12. Simple classification experiment

Perform now the following experiment.

• Load the IMOX data by a = imox. This is a feature based character recognition dataset.

13

• What are the class labels?

• Split the dataset in two parts, 80% for training and 20% for testing.

• Store the true labels of the test set using getlabels into lab_true

• Compute the Fisher classifier

• Classify the test set

• Store the labels found by the classifier for the test set into lab˙test

• Display the true and estimated labels by disp([lab_true lab_test])

• Predict the classification error of the test set by observing the output.

• Verify this number using testc.

Example 8. Cell arrays with datasets and mappings

A set of datasets can be stored in a cell array:

>> A = {gendath gendatb}

The same holds for mappings and classifiers:

>> W = {nmc fisherc qdc}

As the multiplication of cell array can not be overloaded A*W can not be used to train classifiersstored in call arrays. However, V = map(A,W) works. Try it. Try also the gendat and testc

commands for cell arrays.

Exercise 13. Classification of large datasets

Try to find out what the best classifier is for the six mfeat datasets (mfeat_fac, mfeat_fou,mfeat_kar, mfeat_mor, mfeat_pix, mfeat_zer). These are different feature sets for the sameobjects. Take a fixed training set of 30 objects per class and use the others for testing. Makesure that all the six training sets refer to the same objects. This can be done by resetting therandom seed by rand(’seed’,1) or by using the indexes returned by gendat.

Try the following classifiers: nmc, ldc([],1e-2,1e-2), qdc([],1e-2,1e-2), fisherc,knnc, parzenc. Write a macro script that produces a 6× 6 table of errors (Using cell arraysas discussed in example 8 this is a 5-liner). Which classifiers perform globally good? Whichdataset(s) are presumably normally distributed? Which are not?

Example 9. Datafiles

Datafiles are a PRTools extension of datasets, read help datafiles. They refer to raw datadirectories in which every file (e.g. an image) is interpreted as an object. Objects in thesame sub-directory are interpreted as belonging to the same class. There are some predefineddatafiles in prdatafiles, read its help file. As an example load the Flower database, define

14

some preprocessing and inspect a subset:

>> prprogress on

>> a = flowers

>> b = a*im_resize(a,[64,64,3])

>> x = gendat(b,0.05);

>> show(x)

Note that just administration is stored untill real work has to be done by the show command.After feature extraction and conversion to a dataset classifiers can be trained and tested:

>> c = b*im_gray*im_moments([],’hu’)

>> [x,y] = gendat(c,0.05)

>> y = gendat(y,0.1)

>> w = dataset(x)*nmc

>> e = testc(dataset(y),w)

Also here the work starts with the dataset conversion. A number of classifiers and mappingsmay operate directly (without conversion) on datasets, but this appear not be be full proofyet. The classification result in this example is bad, as the features are bad. Look in the helpfile of PRTools for other mappings and feature extractors for images. You may define yourown improcessing operations on datafiles by filtim.

Running Exercise 2. NIST Digit classification

Load a dataset of 50 NIST digits for each of the classes 3 and 5.Compute 2 features.Make a scatterplot.Compute and plot some classifiers, e.g. nmc and ldc.Classify the dataset.Use the routine labcmp to find the erroneously classified objects.Display these digits using the show command. Try to understand why the are incorrectlyclassified given the features.

15

3 Neural network classifiers

In PRTools three neural network classifiers are implemented based on an old version ofMatlab’s Neural Network Toolbox:

• bpxnc a feed-forward network (multi-layer perceptron), trained by a modified back-propagation algorithm with variable learning parameter.

• lmnc a feed-forward network, trained by the Levenberg-Marquardt rule.

• rbnc a radial basis network. This network has always one hidden layer which is extendedwith more neurons as long as necessary.

These classifiers have built-in choices for target values, step sizes, momentum terms, etcetera.No weight decay facilities are available. Stopping is done for no-improvement on the trainingset, no improvement on a validation set error (if supplied) or at a given maximum number ofepochs.

In addition the following neural network classifiers are available:

• rnnc feed-forward network (multi-layer perceptron) with a random input layer and atrained output layer. This has a similar architecture as bpxnc and rbnc, but is muchfaster.

• perlc single layer perceptron with linear output and adjustable step sizes and targetvalues.

Example 10. The neural network as a classifier

The following lines demonstrate the use of the neural network as a classifier:

>> a = gendats; scatterd(a)

>> w = lmnc(a,3,1); h = plotc(w);

>> for i=1:50,

w = lmnc(a,3,1,w);delete(h);h=plotc(w);disp(a*w*testc); drawnow;

end

Repeat these lines if you expect a further improvement. Repeat the experiment for 5 and 10hidden units. Try also the use of the back-propagation rule (bpxnc).

Exercise 14. A neural network classification experiment

Compare the performance of networks trained by the Levenberg-Marquardt rule (lmnc) withdifferent numbers of hidden units: 3, 5 and 10 for a three class digit problem (2, 3 and 5). Usethe NIST16 dataset (a = nist16) . Reduce the dimensionality of the feature space by pca toa space that contains 90% of the original variance. Use training sets of 5, 10, 20, 50 and 100objects per class and a large test set. Plot the errors on the training set and the test set as afunction of the training size. Which networks are overtrained? What can be changed in thisnetwork to avoid overtraining?

Exercise 15. Overtraining (optional)

16

Study the errors on training and test set as a function of training time (number of epochs) fora network with one hidden layer of 10 neurons. Use as classification problem gendatc with25 training objects per class. Do this for lmnc as well as for bpxnc.

Exercise 16. Number of hidden units (optional)

Study the influence of the number of hidden units on the test error for the same problem andthe same classifiers as in the overtraining exercise 41.

Exercise 17. Network outputs and posterior probabilities (optional)

Network output values are normalised, like for all classifiers, by a*w*classc. Compare theseoutcomes for test sets with the posterior probabilities found for the normal density basedclassifier qdc and with the ’true’ posterior probabilities found for a qdc classifier based on avery large training set. This comparison might be based on scatter plots. Use data based onnormal distributions. Train the network with various numbers of steps and try a small and alarge number of hidden units.

17

4 Classifier evaluation and error estimation

Example 11. Evaluation

The following routines are available for the evaluation of classifiers:

testc test a dataset on a trained classifier

crossval train and test classifiers by cross validation

cleval classifier evaluation by computing a learning curve

reject computation of an error-reject curve

roc computation of a receiver-operator curve

gendat split a given dataset at random into a training set and a test set.

A simple example of the generation and use of a test set is the following:

11.1 Load the mfeat_kar dataset, consisting of 64 Karhunen-Loeve coefficients measuredfor 10*200 written digits (’0’ to ’9’). A training set of 50 objects per class (i.e. a fraction of0.25 of 200) can be generated by:

>> a = mfeat_kar

MFEAT KL Features, 2000 by 64 dataset with 10 classes: [200 ... 200]

>> [trainset,testset] = gendat(a,0.25)



50 × 10 objects are stored in trainset, the remaining 1500 objects are stored in testset.Train the linear normal densities based classifier and test it:

>> w = ldc(trainset);

>> testset*w*testc

Compare the result with training and testing by all data:

>> a*ldc(a)*testc

which is probably better for two reasons. Firstly, it uses more objects for training, so a betterclassifier is obtained. Secondly, it uses the same objects for testing as well a for training, bywhich the test result is positively biased. Because of that, the use of separate sets for trainingand testing has to be preferred.

Example 12. Classifier performance

In this exercise we will investigate the difference in behaviour of the error on the training andthe test set. Generate a large test set and study the variations in the classification error basedon repeatedly generated training sets:

>> t= gendath([500 500]);

>> a = gendath([20 20]); t*ldc(a)*testc

Repeat this last line e.g. 30 times. What causes the variations in error?

18

Now do the same for different test sets:

>> a= gendath([20 20]);

>> w = ldc(a);

>> t = gendath([500 500]); t*w*testc

Repeat the last line e.g. 30 times and try to understand the size of the variance in the results.

Example 13. Use of cell arrays for classifiers and datasets

In finding the best classifiers over a set of datasets the Matlab cell arrays can be very useful.A cell array is a collector of arbitrary items. For instance a set of untrained classifiers can bestored as follows:

>> classifiers = {nmc,parzenc([],1),knnc([],3)}

and a set of datasets is similarly stored as:

>> data = {iris,gendath(50),gendatd(30,30,10),gendatb(100)}

Training and test sets can be generated for all datasets simultaneously by

>> [trainset,testset] = gendat(data,0.5)

In a similar way classifiers and error estimation can be done:

>> w = map(trainset,classifiers)

>> testc(testset,w)

Note that the construction w = trainset*classifiers doesn’t work for cell arrays. Crossvalidation can be done by:

>> crossval(data,classifiers,5)

The parameter ’5’ indicates 5-fold cross-validation, i.e. a rotation over training sets of 80% andtest sets of 20% of the data. If this parameter is omitted the leave-one-out error is computed.For the nearest neighbour rule this is also done by testk. Take a small dataset a and verifythat testk(a) and crossval(a,knnc([],1)) yield the same result. Note how much moreefficient the specialised routine testk is.

Example 14. Learning curves introduction

An easy to use routine for studying the behaviour of a classifier on a given dataset is cleval:

>> a = gendatb([30 30])

>> e = cleval(a,ldc,[2 3 5 10 20],3)

This generates at random training sets of sizes [2 3 5 10 20] per class out of the dataset aand trains the classifier ldc. The remaining objects are used for testing (so in this examplethe set a has to contain more than 20 objects per class). This is repeated 3 times and theresulting errors are averaged and returned in the structure e. This is ready made for plottingthe so called learning curve by:

>> plotr(e)

19

which automatically annotates the plot.

Exercise 18. Learning curve experiment

Plot the learning curves of qdc, udc, fisherc and nmc for gendath using training set sizesranging from 3 to 100. Do the same for a 20-dimensional problem generated by gendatd.Study and try to understand the results.

Example 15. Confusion matrices

A confusion matrix C has in element C(i,j) the confusion between the classes i and j.Confusion matrices are especially useful in multi-class problems for analysing the similaritiesbetween classes. For instance, let us take the IMOX dataset a = imox, split it for training andtesting by [train_set,test_set] = gendat(a,0.5). We can now compare the true labelsof the test set with the estimated ones found by a classifier:

>> true_lab = getlab(test_set);

>> w = fisherc(train_set);

>> est_lab = test_set*w*labeld;

>> confmat(true_lab,est_lab)

Exercise 19. Confusion matrix experiment

Compute the confusion matrix for fisherc applied to the two digit feature sets mfeat_kar

and mfeat_zer. One of these feature sets is rotation invariant. Which one?

Exercise 20. Bootstrap error estimates (optional)

Note that gendat can be used for bootstrapping datasets. Write two error estimation routinesbased on bootstrap based bias corrections for the apparent error:

e1 = ea - (eba - ebc)

e2 = .348 ea + .632 ebo

in which ea is the apparent error of the classifier to be tested, eba is the bootstrap apparenterror, ebc is the apparent error (based on the whole training set) of the bootstrap basedclassifier and ebo is the out-of-bootstrap error estimate of the bootstrap based classifier.These estimates have to be based on a series of bootstraps, e.g. 25.

Exercise 21. Cross-validation (optional)

Compare the error estimates of 2-fold cross validation, 10-fold cross validation, the leave-oneout error estimate (all obtained by crossval) and the true error (based on a very large testset) for a simple problem, e.g. gendath with 10 objects per class, classified by fisherc. Inorder to obtain significant results the entire experiment should be repeated a large number oftimes, e.g. 50. Verify whether this is sufficient by computing the variances in the obtainederror estimates.

Example 16. Reject curves

The classification error for a classification result d = a*w is found by e = testc(d) afterdetermining the largest value in each column of d. By rejection of objects a threshold isused to determine when this largest is not sufficiently large. The routine e = reject(d)

determines the classification error and the reject rate for a set of such threshold values. Theerrors and reject frequencies are stored in e. We will illustrate this by a simple example.

20

16.1 Load a dataset by gendath for training Fisher’s classifier:

>> a = gendath([100 100]); w = fisherc(a);

Take a small test set:

>> b = gendath([20 20])

Classify it and compute its classification error:

>> d = b*w; testc(d)

Compute the reject/error trade off:

>> e = reject(d)

Errors are stored in e.error and rejects are stored in e.xvalues. Inspect them by

>> [e.error; e.xvalues]’

The left column shows the error for the reject frequencies shown in the right column. It startswith the classification error found above by testc(d) for no reject (0) and runs to an errorof 0 and a reject of 1 at the end. e.xvalues is the reject rate, starting at no reject. Plot thereject curve by:

>> plotr(e)

16.2 Repeat this for a test set b of 500 objects per class. How many objects have to berejected to have an error of less than 0.06?

Exercise 22. Reject experiment

Study the behavior of the reject curves for nmc, qdc and parzenc for the sonar dataset(a = sonar). Take training sets and test sets of equal size ([b,c] = gendat(a,0.5)). Studyhelp reject to see how a set of reject curves can be computed simultaneously. Plot the resultby plotr. Try to understand the reject curve for qdc.

Example 17. ROC curves

The roc command computes separately the classification errors for each of the classes for var-ious thresholds. Results for a two-class problem can again be plotted by the plotr command,e.g.

>> [a,b] = gendat(sonar,0.5)

>> w1 = ldc(a);

>> w2 = nmc(a);

>> w3 = parzenc(a);

>> w4 = svc(a);

>> e = roc(b,{w1 w2 w3 w4});>> plotr(e)

This plot shows how the error shifts from one class to the other class for a changing threshold.Try to understand what these plots indicate for the selection of a classifier.

21

Exercise 23. Construction of the ROC curve

Create your own function myroc constructing an ROC curve on a given dataset with classifieroutputs. Hint: use the classifier outputs for each of the test examples as thresholds. Thenecessary errors measures for each threshold may be obtained using confmat.

Exercise 24. Derivation of additional costs (optional)

Adapt the myroc function to return the precision or the positive fraction measure. Comparethe behaviour of ROCs and the precision-recall (or positive fraction vs TPr) curves for testsets with different skew ratios.

Running Exercise 3. NIST Digit confusion matrix

Load a dataset of 100 NIST digits for all classes 0 - 9.Compute the Hu moments using immoments.Split it in a training and a test set of equal sizes.Compute and display the confusion matrix of the test result for the nmc classifier.Repeat this after reversing the roles of training and test sets.Study the stability.

22

5 Cluster Analysis and Image Segmentation

Example 18. The k-means Algorithm

We will show the principle of the k-means algorithm graphically on a 2-dimensional dataset.This is done in several steps.

1. Take a 2-dimensional dataset, e.q. a = gendatb;. Set k=4.

2. Initialise the procedure by randomly taking k objects from the dataset:

>> L=randperm(size(a,1)); L=L(1:k);

3. Now, use these objects as the prototypes (or centres) of k centres. Defining labels 1 tok, the nearest mean classifier considers each object as a single cluster:

>> w=nmc(dataset(a(L,:),[1:k]’));

4. Repeat the following line until the plot does not change. Try to understand what hap-pens:

>> lab=a*w*labeld; a=dataset(a,lab); w=nmc(a); scatterd(a);plotc(w)

Repeat the algorithm with another initialisation, on another dataset and some values for k.What happens when the nmc classifier in step 3 is replaced by ldc or qdc?

A direct way to perform the above clustering is facilitated by kmeans. Run kmeans on oneof the digit databases (for instance mfeat_kar) with k>=10 and compare the resulting labelswith the original ones (getlab(a)) using confmat.

Try to understand what a confusion matrix should show when the k-means clustering hadresulted into a random labeling. What does this confusion matrix show about the data dis-tribution?

Example 19. Hierarchical clustering

Hierarchical clustering derives a full dendrogram (a hierarchy) of solutions. Let us investigatethe dendrogram construction on the artificial dataset r15. Because the hierarchical clusteringoperates directly on dissimilarities between data examples, we will first compute the fulldistance matrix (here using the squared Euclidean dissimilarity):

>> load r15

>> d=distm(a);

>> den=hclust(d,’s’); % using single-linkage algorithm

The dendrogram may be visualised by figure; plotdg(den);. It is also possible to use aninteractive dengui command simultaneously rendering both the dendrogram and the scatter-plot of the original data:

>> dengui(a,den)

23

The user may interactively change the dendrogram threshold and thereby study the relatedgrouping of examples.

Exercise 25. Differences in single- and complete- linkage clusterings

Compare the single- and complete-linkage dendrograms, constructed on the r15 dataset usingthe squared Euclidean distance measure. Which method is suited better for this problem andwhy? Compare the absolute values of thresholds in both situations - why can we observe anorder of magnitude difference?

Exercise 26. Maximum lifetime criterion (optional)

Each clustering solution in a dendrogram survives over a set of thresholds. The dendrogrammay be cut by selecting the most stable solution i.e. the clustering with the maximum lifetime.

For a given dendrogram, find the threshold corresponding to the maximum lifetime. Useden_getthrs function to retrieve the list of all thresholds. Show the scatter plot of therespective clustering (the labeling specific to the particular threshold may be obtained by theden_getlab function).

Example 20. Clustering by the EM-Algorithm

A more general version of k-means clustering is supplied by emclust which can be used forseveral classification algorithms instead of nmc and which returns a classifier that may be usedto label future datasets in the same way as the obtained clustering.

The following experiment investigates the clustering stability as a function of the sample size.Take a dataset a and compute for a given choice of the number of clusters k the clustering ofthe entire dataset (e.g. using ldc as a classifier) by:

>> [lab,v] = emclust(a,ldc([],1e-6,1e-6),k);

Here v is a mapping that by d = a*v ’classifies’ the dataset according to the final clustering(lab = d*labeld). Note that for small datasets or large values of k some clusters mightbecome small use classsizes(d)) for the use of ldc. Instead nmc may be used.. The dataseta can now be given the cluster labels lab by:

>> a = dataset(a,lab)

This dataset will be used for studying the clustering stability in the following experiments.The clustering of a subset a1 of n samples per cluster of a:

>> a1 = gendat(a,repmat(n,1,k))

can now be found from

>> [lab1,v1] = emclust(a1,ldc([],1e-6,1e-6));

As the clustering is initialized by the labels of a1, the difference e in labeling between a andthe one defined by v1 can be measured by a*v1*testc, or in a single line:

>> [lab1,v1]=emclust(gendat(a,n),ldc([],1e-6,1e-6)); e=a*v1*testc

24

Average this over 10 experiments and repeat for various values of n. Plot e as a function ofn.

Exercise 27. Semi-supervised learning

We will study the usefulness of unlabelled data in wrapper approach

Various self-learning methods are implemented through emc. Investigate how the usefulness ofunlabelled data depends on training samples size and ratio of labelled vs. unlabelled data. Arethere significant performance differences between different choices of cluster model mappings(e.g. nmc or parzenc)? Are there clear performance differences depending on whether thedata is indeed clustered or not (e.g. gendats vs. gendatb)?

Running Exercise 4. NIST Digit clustering

Load a dataset A of 25 NIST digits for all classes 0-9.Compute the 7 Hu moments:Perform a cluster analysis by kmeans with k = 10 neglecting the original labels.Compare the cluster labels with the original labels using confmat.

25

6 Dissimilarity Based Representations

Example 21.

21.1 Dissimilarity based (relational) representations Any feature based representation a (e.g.a = gendath(100)) can be converted into a (dis)similarity representation d using the proxm

mapping:

>> w = proxm(b,par1,par2); % define some dissimilarity measure

>> d = a*w; % apply to the data

in which the representation set b is a small set of objects. In d all (dis)similarities between theobjects in a and b are stored (depending on the parameters par1 and par2, see help proxm).b can be a subset of a. The dataset d can be used similarly as a feature based set. Adissimilarity based classifier using a representation set of 5 objects per class can be trainedfor a training set as:

>> b = gendat(a,5); % the representation set

>> w = proxm(b); % define an Euclidean distance mapping to the objects in b

>> v = a*w*fisherc; % map all data on the representation set and train

>> u = w*v; % combine the mapping and the classifier

This dissimilarity based classifier for the dataset a can also be computed by one-line:

>> u = a*(proxm(gendat(a,5))*fisherc);

It is like an ordinary classifier in the feature space of a, It can be tested, by a*u*testc.

21.2 Embedding of dissimilarity based representations A symmetric n× n dissimilarity rep-resentation d (e.g. d = a*proxm(a,c) can be embedded into a pseudo-Euclidean space as

>> [v,sig,l] = goldfarbm(d);

v is the mapping, sig = [p q] is the signature of the pseudo-Euclidean space and l are thecorresponding eigenvalues (first p positive ones, then q negative ones). To check whether d isEuclidean, you can investigate whether all eigenvalues l are nonnegative. They can be plottedby:

>> plot(l,’*’)

The embedded configuration is found as:

>> x = d*v;

The 3D approximate (Euclidean) embedding can then be plotted by

>> scatterd(x,3);

To project to m most significant dimensions, use

>> [v,sig,l] = goldfarbm(d,m);

Exercise 28. Scatter plot with dissimilarity based classifiers

Generate a training set of 50 objects per class for the banana-set (gendatb). Make a scatter

26

plot of the training set and make the representation set visible as well. Compare the dissimi-larity based classifier using Euclidean distances and a representation set of 5 objects per classwith the svc for a polynomial of degree 3 (svc([],’p’,3)). Repeat this for a dissimilaritybased classifier using 10 objects per class.

Example 22. Different dissimilarities

Sometimes objects are not given by features but directly by dissimilarities. Examples are thedistance matrices between 400 images of hand-written digits ’3’ and ’8’. They are based onfour different dissimilarity measures: Hausdorff, modified Hausdorff, blurred, Euclidean andHamming. Load a dataset d by load hamming38. It can be split in sets for training andtesting by

>> [dtr,dte,i] = gendat(d,10); dtr = dtr(:,i); dte = dte(:,i);

The dataset dtr is now a 20× 20 dissimilarity matrix and dte is a 380× 20 matrix based onthe same representation set. A simple trick to find the 1-NN error of dte based on the givendistances is

>> (1-dte)*testc

A classifier in the representation space can be trained on dtr and tested by dte as:

>> dte*fisherc(dtr)*testc

Exercise 29. Learning curves for dissimilarity representations

Consider four dissimilarity representations for 400 images of handwritten digits of ’3’ and’8’: hamming38, blur38, haus38 and modhaus38. Which of the dissimilarity measures areEuclidean and which not (goldfarbm)? Try to find out the most discriminative measure forlearning in dissimilarity spaces. For each distance dataset d, split it randomly into the trainand test dissimilarity data (see Example 21), select randomly a number of prototypes andtrain a linear classifier (e.g. fisherc, ldc, loglc). Find the test error. Repeat it e.g. 20times and average the classification error. Which dissimilarity data allows for reaching thebest classifier performance? Do the results depend much on a number of prototypes chosen?

Exercise 30. Dissimilarity application on spectra

Two datasets with spectral measurements from a plastic sorting application are provided:spectra_big and spectra_small. spectra_big contains 16 classes and spectra_small twoclasses. The spectra are sampled to 120 wavelengths (features). You may visualize spectralmeasurements, stored in a dataset by using the plots command.

Three different dissimilarity measures are provided, specific to the spectra data:

dasam: Spectral Angle Mapper measures the angle between two spectra interpreted as pointsin a vector space (robust to scaling)

dkolmogorov: Kolmogorov dissimilarity measures the maximum difference between the cu-mulative distributions (the spectra should be appropriately scaled to be interpreted as such)

dshape: Shape dissimilarity measures the sum of absolute differences (city block distance)between the smoothed derivatives of the spectra (uses the Savitsky-Golay algorithm)

Compute a dissimilarity matrix d for the measures described. The nearest-neighbour error may

27

be estimated by using the leave-one-out procedure by the nne routine. In order to evaluateother types of classifiers, a cross-validation procedure must be carried on. Note, that cleval

cannot be used for dissimilarity matrices! Use the crossvald routine instead.

Using the cross-validation approach (crossvald), estimate the performance of the nearestneighbour classifier with one randomly selected prototype per class. To do that use theminimum distance classifier mindistc. nne will not work here. Repeat the same for a largernumber of prototypes. Test also the full nearest neighbour (with as many prototypes aspossible) and a Fisher linear discriminant (fisherc), trained in a dissimilarity space. Findout if fisherc outperforms the nearest neighbour rule and if so, how many prototypes sufficeto reach this point?

Running Exercise 5. NIST Digit dissimilarities

Load a dataset A of 200 NIST digits for the classes 1 and 8.Select by gendat at random a dataset B of one sample per class.Use hausdm to compute the standard and modified Hausdorff distances between A and B.Study the scatterplots.

28

7 Feature Spaces and Feature Reduction

Example 23. Mapping

There are several ways to perform feature extraction. Some common approaches are:

• PCA on the complete dataset. This is unsupervised, so it does not use class information.It only tries to describe the variance in data. In PRTools, this mapping can be trainedby using pca on a (labeled or unlabeled) dataset: e.g. w = pca(a,2) finds a mappingto 2 dimensions. scatterd(a*w) plots these data.

• PCA on the classes. This is supervised as it makes use of class labels. The PCA is com-puted on the average of the class covariance matrices. In PRTools, this mapping can betrained by using klm (Karhunen Loeve mapping) on a labeled dataset a: w = klm(a,2)

• Fisher mapping. This tries to maximise the between scatter over the within scatter ofthe different classes. It is, therefore, supervised: w = fisherm(a,2)

23.1 Apply the three methods on mfeat_pix and investigate if, and how, the mapped resultsdiffer.

23.2 Perform plot(pca(a,0)) to see a plot of the relative cumulative ordered eigenvalues(normalised sum of variances). In what range lies the intrinsic dimensionality?

23.3 After mapping the data, use some simple classifiers to investigate how the choice of themappings influences the classification performance in the 2-dimensional feature spaces.

Exercise 31. Eigenfaces and Fisherfaces

The linear mappings used in the example above may also be applied to image datasets inwhich each pixel is a feature, e.g. the Face-database containing images of 92× 112 pixels. Animage is now a point in a 10304 dimensional feature space.

31.1 Load a subset of 10 classes by a = faces([1:10],[1:10]). The images can be dis-played by show(a).

31.2 Plot the explained variance for the PCA as a function of the number of components.When and why does this curve reach the value 1?

31.3 For each of the three mappings, make a 2D scatter plot of all data mapped on the firsttwo vectors. Try to understand what you see.

31.4 The PCA eigenvector mapping w points to positions in the original feature space calledeigenfaces. These can be displayed by show(w). Display the first 20 eigenfaces computed bypca as well as by klm and the Fisherfaces of the dataset.

Exercise 32. Supervised linear feature extraction

In this exercise, you will experiment with pre-programmed versions of canonical correlationanalysis, partial least squares and linear discriminant analysis.

32.5 Load the iris dataset. This dataset has labels, but we will convert these to real-valued

29

outputs. In PRTools, this can be done as follows:

>> load iris

>> b = setlabtype(a,’targets’);

Dataset b now contains real-valued target vectors.

Make a scatterplot of a and the targets in b; you can extract the targets using gettargets(b).What do you notice about the targets in the scatterplot? 32.6 Calculate a canonicalcorrelation analysis (CCA) between the data and targets in b: [wd,wt] = cca(b,2);. Makea scatterplot of the data projected using wd and the targets using wt. Can you link what yousee to what you know about CCA? 32.7 Calculate a 2D linear discriminant analysis (LDA)on a using fisherm. Plot the mapped data and compare to the data mapped by CCA. Whatdo you notice? 32.8 Calculate a partial least squares (PLS) mapping, using pls. Plot themapped data and the mapped target values, like you did for CCA. Do you see any differencesbetween this mapping and the one by CCA? What do you think causes this?

Exercise 33. Embeddings

Load the swiss-roll data set, swissroll. It contains 1000 samples on a 3D Swiss-roll-likemanifold. Visualise it using scatterd(a,3) and rotate the view to inspect the structure. Thelabels are there just so that you can inspect the manifold structure later; they are not used.

33.9 Apply locally linear embedding (LLE) using the lle function. This function is not aPRTools command: it outputs the mapped objects, not a mapping. Plot the resulting 2Dembedded data. What do you notice?

The default value for the number of neighbours to use, k, is 10. What value gives betterresults?

You can also play with the regularisation parameter (the fourth one). Try some small values,e.g. 0.001 or 0.01.

33.10 (*) Some routines are given to:

• perform a kernel PCA (kernelm) and plot it (plotm);

• train a self-organising map (som) and display it (plotsom);

• perform multi-dimensional scaling (mds).

• perform Isomap (isomap);

Read their help and try to apply them to the swissroll data or your favourite dataset. Ifthe functions take too much time, you can try to first select a subset of the data.

Exercise 34. Feature Evaluation

The routine feateval can be used to evaluate feature sets according to a criterion. For agiven dataset, it returns either a distance between the classes in the dataset or a classificationaccuracy. In both cases it means that large values means good separation.

34.11 Load the dataset biomed. How many features does this dataset have? How manypossible subsets of two features can be made from this dataset? Make a script which loops

30

through all possible subsets of two features and that creates for each combination a new datasetb. Use feateval to evaluate b using the Euclidean distance, the Mahalanobis distance andthe leave-one-out error for the one-nearest neighbour rule.

34.12 Find, for each of the three criteria, the two features that are selected by individualranking (use featseli), by forward selection (use featself) and by the above procedurethat finds the best combination of two features. Compute for each set of two features theleave-one-out error for the one-nearest neighbour rule by testk.

Exercise 35. Feature Selection

Load the glass dataset. Rank the features by the sum of the Mahalanobis distances, us-ing individual selection (featseli), forward selection (featself) and backward selection(featselb). The selected features can be retrieved from the mapping w by:

>> w = featseli(a,’maha-s’);

>> getdata(w)

Compute for each feature ranking an error curve for the Fisher classifier by clevalf.

>> rand(’seed’,1); e = clevalf(a*w,fisherc,[],[],5)

The random seed is reset to make the results for different feature sequences w comparable.The command a*w reorders the features in dataset a according to w. In clevalf, the classifieris trained by a bootstrapped version of the given dataset. The remaining objects are used fortesting. This is repeated 5 times. All results are stored in a structure e that can be visualisedby plotr(e).

Plot the result for the three feature sequences obtained by the three selection methods in asingle figure by plotr. Compare this error plot with a plot of the ’maha-s’ criterion valueas a function of the feature size (use feateval).

Exercise 36. Feature scaling

Besides classifiers that are hampered by the amount of features, some classifiers are sensitiveto the scaling of the individual features. This can be studied by an experiment in which thedata is good and one in which the data is badly scaled.

In relation with sensitivity to badly scaled data, three types of classifiers can be distinguished:

1. classifiers that are scaling independent

2. classifiers that are scaling dependent, but that can compensate badly scaled data bylarge training sets.

3. classifiers that are scaling dependent, that cannot compensate badly scaled data by largetraining sets.

First, generate a training set of 400 points for two normally distributed classes with commoncovariance matrix, as follows:

>> a = gauss(400,[0 0; 2 2],eye(2))

31

Prepare another dataset b by scaling down the second dimension of dataset a as follows:

>> x = +a; x(:,2) = x(:,2).*0.01; b = setdata(a,x);

Study the scatter plot of a and b (e.g. scatterd(a)) and note the difference when the scatterplot of b is scaled properly (axis equal).

Which of the following classifiers belong to which type (1,2 or 3)?:

• nearest mean (nmc),

• 1-nearest neighbour (knnc([],1)),

• LESS (lessc([],1e6)), and

• the Bayes classifier assuming normal distributions (qdc)?

(Note that for LESS, we set the C parameter high to stress satisfaction of the constraints forcorrect training object classification). It may help if you plot the decision boundaries in thescatter plots of a and b and play with the training set size.

Verify your answer by the following experiment:

Generate an independent test set c and compute the learning curves (i.e. an error curveas function of the size of the training set) for each of the classifiers. Use training sizes of5,10,20,50,100 and 200 objects per class. Plot the error curves.

Use scalem for scaling the features on their variance. For a fair result, this should be computedon the training set b and applied to b as well as to the test set c:

>> w = scalem(b,’variance’); b = b*w; c = c*w;

Compute and plot the learning curves for the scaled data as well. Which classifier(s) areindependent of scaling? Which classifier(s) can compensate bad scaling by a large trainingset?

Exercise 37. High dimensional data

In this exercise, you will experiment with datasets for which the number of features is substan-tially higher than the number of training objects. For this type of dataset, most traditionalclassifiers are not suitable.

37.13 First, load the colon dataset and estimate the performance of the nearest meanclassifier, by cross-validation. Set the number of repetitions for the cross-validation functionhigher (e.g. to 3) to get a more stable performance estimate.

The LESS classifier is a nearest mean classifier with feature scaling. It has an additionalparameter to balance data fit and model complexity.

37.14 Estimate the best C parameter setting for the LESS classifier using cross-validation onthe entire training set. The number of effectively used features can be inspected as follows:

>> w = lessc(a,C);

>> d = getdata(w);

>> d.nr

32

37.15 Now, estimate the generalisation performance of the LESS classifier with optimisedC parameter. Note that for an unbiased performance estimate, the C parameter should beoptimized in each sample of the crossvalidation separately. Use the functions nfoldsubsets,nfoldselect, and testc to do the performance estimation through cross-validation. See howcross-validation can be implemented with these functions in nfoldexample.m.

37.16 In this exercise, you will work again with the colon dataset. First reduce the numberof features to 50 as follows:

>> labs = getnlab(a);

>> m1 = mean(+a(labs==1,:),1);

>> m2 = mean(+a(labs==2,:),1);

>> [dummy,ind] = sort(-abs(m1-m2));

>> a = a(:,ind(1:50));

What is the effect of this code fragment?

Choose a suitable feature selection method and estimate the generalisation performance of thenearest mean classifier with an optimized number of features. For an unbiased performanceestimate, the feature subset should be optimized in each cross-validation sample, as in theprevious exercise.

Is there a large difference when the performance is estimated on the features that are optimisedon the whole dataset?

37.17 Some routines are given to:

• train a LASSO classifier (lassoc);

• train a LIKNON classifier (liknonc).

Read their help and try to apply them to a high-dimensional dataset.

33

8 Complexity and Support Vector Classifiers

Exercise 38. Dimensionality Resonance

Generate a 10-dimensional dataset generated by gendatd. Use cleval with repetition factor10 to study the learning curves of fisherc and qdc for sample sizes between 2 and 20 objectsper class as plotted by plotr. Note that on the horizontal axis the sample size per class islisted. Explain the maxima.

Study also the learning curve of fisherc for the dataset mfeat_kar. Where is the maximumin this curve and why?

Exercise 39. Regularization

Use again a 10-dimensional dataset generated by gendatd. Define three classifiers: w1 = qdc,w2 = qdc([],1e-3,1e-3) and w3 = qdc([],1e-1, 1e-1). Name them differently usingsetname. Combine them in a cell array and compute and plot the learning curves between 2and 20 objects. Study the effect of regularization. What is gained and what is lost?

Example 24. Support Vectors - an illustration

The routine svc can be used for building linear and nonlinear support vector classifiers.Generate a 2-dimensional dataset of 10 objects per class

>> a = gendatd([10 10])

Compute a linear support vector by

>> [w,J] = svc(a)

In J the indices to the support objects are stored. Plot data, classifier and support objectsby:

>> scatterd(a)

>> plotc(w)

>> hold on; V = axis; scatterd(a(J,:),’o’); axis(V);

Repeat all this for 50 objects per class generated for the Banana set by gendatb, using a 3rdorder polynomial classifier. A 3rd order polynomial support vector classifier can be obtainedby setting the kernel to a polynomial kernel, with degree 3: [w,J] = svc(a,’p’,3).

Replace the polynomial kernel by other kernels (use help svc and help proxm to see whatpossibilities you have).

Exercise 40. Support Vectors

Add the support vector classifier to exercise 39 and repeat it. Tricky question: How does thecomplexity of the support vector classifier depend on the trade- off parameter C (which weighsthe errors against the ‖w‖2)?

Exercise 41. Classification Error

Generate a training set of 50 objects per class and a test set of 100 object per class, usinggendatb. Train several support vector classifiers with an RBF kernel using different widthvalues sigma. Compute for each of the classifiers the error (on the test set) and the number

34

of support vectors. Make a plot of the error and the number of support vectors as a functionof sigma. How well can the optimal sigma be predicted by the number of support vectors?

Exercise 42. Support Objects

Load a two class digit recognition problem by a = seldat(nist16,[1 2],[],[1:50]). In-spect it by the show command. Project it on a 2D feature space by PCA and study thescatter plot. Find a support vector classifier using a quadratic polynomial kernel. Visualisethe classifier and the support objects in the scatter plot. Look also at the support objectsthemselves by the show command. What happens with the number of support objects forhigher numbers of principal components?

Running Exercise 6. NIST Digit classifier complexity

Load a dataset A of 200 NIST digits for the classes 3 and 5.Compute the Zernike moments:Split the data in a training set of 25 objects per class and a test set.Order the features on their individual performance.Compute feature curves for the classifiers nmc, ldc and qdc.

35

9 One-Class Classifiers

Example 25. One-class models

The following classifiers are a subset of the available classifiers that can be used to solveone-class classification problems:

gauss_dd Gaussian data description

mog_dd Mixture-of-Gaussians data description

parzen_dd Parzen data description

nndd Nearest neighbour data description

kmeans_dd k-means data description

pca_dd Principal Component Analysis data description

incsvdd (Incremental) Support vector data description

sdroc ROC estimation using the PRSD toolbox

sddrawroc Interactive ROC plot and selection of an operating point

Use help to get an idea what these routines do. Notice that all the classifiers have the samestructure: the first parameter is the dataset and the second parameter is the error on thetarget class. The next parameters set the complexity of the classifier (if it can be influencedby the user; for instance the k in the k-means data description) or influences the optimizationof the method (for instance, the maximum number of iterations in the Mixture of Gaussians).

Before these routines can be used on a data set, the class labels in the datasets shouldbe changed to ’target’ and (possibly) ’outlier’. This can be done using the routinestarget_class and oc_set. Outliers can, of course, only be specified if they are available.

Exercise 43. Fraction target reject

Take a two-class dataset (e.g. gendatb, gendath) and convert it to a one-class dataset usingtarget_class. Use the one-class classifiers given above to find a description of the data.Make a scatterplot of the data and plot the classifiers. Firstly, experiment with differentvalues for the fraction of target data to be rejected. What is the influence of this parameteron the shape of the decision boundary?

Secondly, vary the other parameters of the incsvdd, kmeans_dd, parzen_dd and mog_dd.These parameters characterise the complexity of the classifiers. How does that influence thedecision boundary?

Exercise 44. ROC curve

Generate a new one-class dataset a using oc set (so that the dataset contains both targetand outlier objects), and split it in a train and test set. Train a classifier w on the trainingset, and plot the decision boundary in the scatterplot.

Make a new figure, and plot the ROC curve there using:

>> h = plotroc(w,a);

There should be fat dot somewhere on the ROC curve. This is the current operating point.By moving the mouse and clicking on another spot, the operating point of the classifier canbe changed. The updated classifier can be retrieved by w2=getrocw(h).

36

Change the operating point of the classifier, and plot the resulting classifier again in thescatterplot. Do you expect to see this new position of the decision boundary?

Exercise 45. Handwritten digits dataset

Load the NIST16 dataset (a = nist16). Choose one of the digits as the target class and allothers as the outlier class using oc_set. Build a training set containing a fraction of the targetclass and a test set containing both the remainder of the target class and the entire outlierclass. Compute the error of the first and second kind (dd_error) for some of the one-classclassifiers. Why do some classifiers crash, and why do other classifiers work?

Plot receiver-operator characteristic curves (dd_roc) for those classifiers in one plot. Whichof the classifiers performs best?

Compute for the classifiers the Area under the ROC curve (dd_auc). Does this error confirmyour own preference?

Example 26. Outlier robustness

In this example and the next exercise we will investigate the influence of the presence of anoutlier class on the decision boundary. In this example data is classified using support vectordata description (incsvdd).

Run the routine: sin_out(4,3) This routine creates target data from a sinusoid distribution,places an outlier at (x,y) (here (x,y) = (4,3)) and calculates a data description.

Investigate the influence of the outlier on the shape of the decision boundary by changing itsposition.

Exercise 46. Outlier robustness

Investigate the influence of an outlier class on a decision boundary for other one-class classifiers.Convert a two-class dataset (e.g. gendath) to a one-class dataset by changing all labels to’target’ (e.g. using target_class(+a) or oc_set(+a)). Find a decision boundary for justthe target class.

Manually add outliers to your dataset. Compare the decision boundaries.

Exercise 47. Outliers in handwritten digits dataset

Load the Concordia dataset using the routine concor_data. Convert the entire data set toa target class (this time the target class consists of all digits) and split it into a train and testset.

Train a one-class classifier w on the train set. Check the performance of the classifier on thetest set z and visualise those digits classified as outliers:

>> zt = target_class(z); % extract the target objects

>> labzt = zt*w*labeld; % classify the target objects

>> [It,Io] = find_target(labzt); % find which are labeled outlier

>> show(zt(Io,:)) % show the outlier objects

Why do you think these particular digits are classified as outliers?

Repeat this, but before training the classifier, apply a PCA mapping, retaining 95% of the

37

variance. What are the differences?

Exercise 48. AUC for imbalanced classes

Load the heart dataset and convert it to a one-class dataset. Now extract a training set using70% of the data, and put the rest in a test set. Train a standard quadratic classifier (qdc)and the AUC optimizer auclpm. (You can use the default settings: w=auclpm(trainset).)

Compute the ROC curve of both classifiers and plot them. Is there a large difference inperformance?

Now reduce the training set size for one of the classes to 10% of the original size, bytrainsetnew = gendat(trainset,[0.999 0.1]). Train both classifiers again and plot theirROC curves. What has changed? Can you explain that.

Do the same experiments, but now replace the quadratic classifier by a linear classifier. Whatare the differences with the qdc? Explain!

Exercise 49. Kernelizing mappings or classifiers

Generate the Highleyman dataset and train a (linear) AUClpm. Plot the data and the mappingin a scatterplot (use plotm) and see that the linear classifier does not really fit well.

Now kernelize the mapping by preprocessing the dataset through proxm:

>> w_u = proxm([],’r’,2)*auclpm;

>> w = a * w_u;

Also plot the new kernelized mapping.

The kernelized auclpm selects prototypes instead of features. Extract the indices of the pro-totypes by selecting the indices of the non-zero weights in the mapping by:

>> I = find( abs(w.data2.data.u)>1e-6 );

These ’support objects’ for the auclpm can now be plotted by

>> hold on; scatterd(a(I,:),’o’)

Try to kernelize other classifiers: which ones work well, and which one don’t? Explain why.

38

10 Classifier combining

Example 27. Posterior probabilities compared

If w is a classifier then the output of a*w*classc can be interpreted as estimates for theposterior probabilities of the objects in a. Different classifiers produce different posteriorprobabilities. This illustrated by the following example. Generate a dataset of 50 points perclass by gendatb. Train two linear classifiers w1, e.g. by nmc, and w2, e.g. by fisherc. Theposterior probabilities can be found by p1 = a*w1*classc and p2 = a*w2*classc. Theycan be combined in one dataset p = [p1 p2] which has four features (why?). Make a scatterplot of the features 1 and 3. Study this plot. The original classifiers correspond to horizontaland vertical lines at 0.5. There may be other straight lines, combining the two classifiers, thatperform better.

Example 28. Classifier combining strategies

PRTools offers three ways of combining classifiers, called sequential, parallel and stacked.

In sequential combining classifiers operate directly on the outputs of other classifiers, e.g.w = w1*w2. So the features of w2 are the outputs of w1.

In stacked combining typically classifiers computed for the same feature space are com-bined. They are constructed by w = [w1, w2, w3]. If applied by a*w the result isp = [a*w1 a*w2 a*w3].

In parallel combining typically classifiers computed for different feature spaces are combined.They are constructed by w = [w1; w2; w3]. If applied by a*w then a should be the com-bined dataset a = [a1 a2 a3], in which a1, a2 and a3 are datasets defined for the featurespaces in which w1, w2, respectively w3 are found. As a result, p = a*w is equivalent withp = [a1*w1 a2*w2 a3*w3].

Parallel and stacked combining are usually followed by combining. The above constructeddatasets of posterior probabilities p contain multiple columns (features) for each of the classes.Combining reduces this to a single set of posterior probabilities, one for each class, by com-bining all columns referring to the same class. PRTools offers the following fixed rules:

maxc maximum selection

minc minimum selection

medianc median selection

meanc mean combiner

prodc product combiner

votec voting combiner

If the so-called base classifiers (w1, w2, . . .) do not produce posterior probabilities, but forinstance distances, then these combining rules operate similar. Some examples:

28.1 Generate a small dataset, e.g. a = gendatb; and train three classifiers, e.q.w1 = nmc(a)*classc, w2 = fisherc(a)*classc, w3 = qdc(a)*classc. Create a combin-ing classifier v = [w1, w2, w3]*meanc. Generate a testset b and compare the performancesof w1, w2, w3 individually with that of v. Inspect the architecture of the combined classifierby parsc(v).

39

28.2 Load three of the mfeat datasets and generate training and test sets: e.g.

>> a = mfeat_kar; [b1,c1] = gendat(a,0.25)

>> a = mfeat_zer; [b2,c2] = gendat(a,0.25)

>> a = mfeat_mor; [b3,c3] = gendat(a,0.25)

Note the differences in feature sizes of these sets. Train three nearest mean classifiers

>> w1 = nmc(b1)*classc; w2 = nmc(b2)*classc; w3 = nmc(b3)*classc;

and compute the combined classifier

>> v = [w1; w2; w3]*meanc

Compare the performance of the combining classifier with the three individual classifiers:

>> [c1 c2 c3]*v*testc

>> b1*w1*testc, b2*w2*testc, b3*w3*testc

28.3 Instead of using fixed combining rules like maxc, it is also possible to use a trainedcombiner. In this case the outputs of the base classifier are used to train a combining classifierlike nmc or fisherc. This demands the following operations:

>> a = gendatb(50)

>> w1 = nmc(a)*classc, w2 = fisherc(a)*classc, w3 = qdc(a)*classc

>> a out = [a*w1 a*w2 a*w3]

>> v1 = [w1 w2 w3]*fisherc(a_out)

PRTools offers the possibility to define untrained combining classifiers:

>> v = [nmc*classc fisherc*classc qdc*classc]*fisherc

Such a classifier can simply be trained by v2 = a*v

Exercise 50. Stacked combining

Load the mfeat_zer dataset and split it into a training and a test set of equal size. Use thefollowing classifiers: nmc, ldc, qdc, knnc([],3), treec. Determine the performance of eachof them. Try to find a combining classifier that performance better than the best one.

Exercise 51. Parallel combining (optional)

Load all mfeat datasets. Split the data into a training and a test sets of equal size. Makesure that these sets relate to the same objects, e.g. by resetting the random seed each time byrand(’seed’,1) before calling gendat. Compute for each dataset the nearest mean classifierand estimate their performances.Try to find a combining classifier that performance betterthan the best one.

Exercise 52. Bootstrapping and averaging (optional)

The routine baggingc computes a set of classifiers on a single training set by bootstrappingand averaging all coefficients. Compare the performance of a simple classifier like nmc withits bagged version for a 2-dimensional dataset of 20 objects generated by gendatd. Use a atest set of 200 objects. Study the performance for bagging sets of sizes between 10 and 200.

Exercise 53. Bootstrapping and aggregating (optional)

40

The routine baggingc can also be used to combine a set of classifiers based on bootstrapping.using the posterior probability estimates. Combining rules like voting, min, max, mean, andproduct can be used. Compare the performance of a simple classifier like nmc with its baggedversion for a datasets generated by gendatd. Study the scatter and classifier plots.

Running Exercise 7. NIST Digit classifier combining

Load a dataset A of 500 NIST digits for the classes 3 and 5.Compute the Hu moments:Split the data in a training set of 100 objects per class and a test set.Generate at random 10 sub˙datasets of 25 objects per class from the training set and computethe nmc for each of them.Combine the 10 classifiers by various combing rules.Compare the final classifiers with a nmc computed for the total training set by their perfor-mances on the test set.

41

11 Boosting

Example 29. Decision stumps

A decision stump is a simplified decision tree, trained to a small depth, usually just for asingle split. The command stumpc constructs a decision tree classifier until a specified depth.Generate objects according to the banana dataset (gendatb), make a scatterplot and plot init the decision stump classifiers for the depth levels 1, 2 and 3. Estimate the classificationerrors using an independent test set and compare the plots and the resulting error with a fullsize decision tree (treec).

Example 30. Weak classifiers

A family of weak classifiers is available by the command W = weakc(A,ALF,ITER,R) inwhich ALF (0 < ALF < 1) determines the size of a randomly selected subset of the trainingset A to train a classifier determined by (R: R = 0: nmc R = 1: fisherc R = 2: udc

R = 3: qdc In total ITER classifiers are trained and the best one according to the total set A

is selected and returned in W. Define a set of linear classifiers (R = 0,1) for increasing ITER,and include the strong version of the classifier:

v1 = weakc([],0.5,1,0); v1 = setname(v1,’weak0-1’);



w={nmc,v1,v2,v3};

Generate some datasets, e.g. by a=gendath and a=gendatb. Train and plot these classifiersby W = a*w and plotc(W) in the scatterplot (scatterd(a)).

Exercise 54. Weak classifiers learning curves

Compute and plot learning curves for the Highleyman data averaged over 5 iterations ofcrossvalidation for the above defined set of classifiers. Compute and plot learning curves forthe circular classes (gendatc) averaged over 5 iterations of crossvalidation for a set of quadraticweak classifiers.

Example 31. Adaboost

The Adaboost classifier [W,V] = adaboostc(A,BASE-CLASSF,N,COMB-RULE) uses the un-trained (weak) classifier BASE-CLASSF for generating N base classifiers by the training setA, iteratively updating the weights for the objects in A. These weights are used as objectprior probabilities for generating subsets of A for training. The entire set of base classifiers isreturned in V. They are combined by BASE-CLASSF into a single classifier W. Default is thestandard weighted voting cpombiner.

Study the Adaboost classifier for two datasets: gendatb and gendatc. use as base classifierstumpc (decision stump), weakc([],[],1,1) and weakc([],[],1,2).

Plot the final classifier in the scatterplot by plotc(W,’r’,3). Plot also the un-weighted voting combiner by plotc(V*votec,’g’,3) and the trained Fisher combiner byplotc(A*(V*fisherc),’b’,3). It might be needed to improve the quality of the plottedclassifiers by giving gridsize(300), before plotc is executed.

42

Exercise 55. Adaboost

Compute the Adaboost error curve for the sonar dataset for some numbers of boosting steps,e.g. 5 and 100. (Advanced users may try to write a script that plots an entire error curve).Use stumpc as a base-classifier and weighted voting for combining. Try to improve the resultby using other base classifiers and other combiners.

43

12 Image Segmentation and Classification

Example 32. Pixel classification

In the file userpaint an example is shown how a user can label interesting parts in an image,train a classifier and visualize the resulting classification or detections in a new image.

In this example we will use a tiny database, which is a subset of a much larger image data-base collected by the university of Surrey. Three versions have been created. The firstsurrey_col_64 contains color 64 × 64 images, the second surrey_grey_64 contains grey-level 64× 64 images and the third surrey_col_128 contains grey-level 128× 128 images.

Load one of the sets by load surrey_col_64 and show the images by show a.

Look into the file userpaint and try to understand what steps are performed. Notice that thepixels are just represented by their color or grey-level values (as defined in file userpreproc).

Run and play with it (you can change the training and test image, the dataset, the classifierand the region that you paint).

Exercise 56. Supervised and unsupervised pixel classification

Edit the file userpaint and change the given one-class classifier into a two-class classifier.What differences do you observe between a supervised and unsupervised classifier?

Exercise 57. Improved the pixel features (optional)

Invent more interesting features to represent individual pixels. Implement it in (your owncopy of) userpreproc. Does it improve the pixel classification?

Exercise 58. Color image segmentation by clustering

A full-colour image may be segmented by clustering the colour feature space. For example,read the famous Lena image in a 256× 256 version

>> a=lena;

>> show(a)

The image may be reconstructed as a full colour images by:

>> figure; imagesc(reshape(+I,256,256,3));

The 3 colours may be used to segment the images on its pixel values only. We use a smallsubset for finding 4 clusters in the 3d colour space:

>> testset=gendat(a,500) % create small test set

>> [d,w]=emclust(testset,nmc([]),4) % cluster the data

The retrieved classifier w may be used to classify all image pixels in the colour space:

>> lab = classim(a,w);

>> figure

>> imagesc(lab) % view image labels

44

Finally we will replace each of the clusters by its colour mean:

>> aa=dataset(a,lab(:)) % create labeled dataset

>> map=+meancov(aa) % compute class means

>> colormap(map) % set colour map accordingly

Note that the mean colours are very equal. Try to improve the result by using more clusters.

Exercise 59. Texture segmentation

A dataset a in the MAT file texturet contains a 256x256 image with 7 features(bands): 6 were computed by some texture detector; the last one represents the origi-nal gray-level values. The data can be visualised by show(a,7). Segment the image by[lab,w] = emclust(a,nmc,5). The resulting label vector lab may be reshaped into a labelimage and visualised by imagesc(reshape(lab,a.objsize)). Alternatively, we may use thetrained mapping w, re-apply it to the original dataset a and obtain the labels by classim:imagesc(classim(a*w)).

Investigate the use of alternative models (classifiers) in emclust such as the mixture of Gaus-sians (using qdc) or non-parametric approach by the nearest neighbour rule knnc([],1). Howdo the segmentation results differ and why? The segmentation speed may be significantly in-creased if the clustering is performed only on a small subset of pixels.

Exercise 60. Improving spatial connectivity

The routine spatm concatenates for image feature datasets the feature space with the spatialdomain by performing a Parzen classifier in the spatial domain. The two results, feature spaceclassifier and spatial Parzen classifier may now be combined. Let us demonstrate the use ofspatm on a segmentation of a multi-band image emim31:

>> a = emim31;

>> trainset = gendat(a,500); % get a small subset

>> [lab,w] = emclust(trainset,nmc,3);

By applying the trained mapping w to the complete dataset a, we obtain a dataset with clustermemberships:

>> b=a*w

16384 by 3 dataset with 1 class: [16384]

Let us now for each pixel decide on a cluster label and visualise the label image:

>> imagesc(classim(b));

This clustering was entirely based on per-pixel features and, therefore, neglects spatial connec-tivity. By using the spatm mapping, three additional “features” will be added to the datasetb, each corresponding to one of three clusters:

>> c=spatm(b,2) % spatial mapping using smoothing sigma=2.0


Let us visualise the resulting dataset c by show(c,3). The upper row renders three clustermembership confidences estimated by the classifier w. The features in the lower row wereadded by spatm mapping. Notice, that each of them is a spatially smoothed binary imagecorresponding to one of the clusters. By applying a product combiner prodc, we obtain

45

an output dataset with three cluster memberships based on spectral-spatial relations. Thisdataset defines a new set of labels:

>> out=c*prodc


>> figure; imagesc(classim(out))

Investigate the use of other classifiers than nmc and the influence of different smoothing onthe segmentation result.

Exercise 61. Iterative spatial-spectral classifier (optional)

Previous exercise describes a single correction of spectral clustering by means of the spatialmapping spatm. The process of combining the spatial-spectral may be iterated. The labelsobtained by combining the spatial and spectral domains may be used to train separate spec-tral and spatial classifiers again. Let us now implement a simple iterative segmentation andvisualise image labelings derived in each step:

>> trainset = gendat(a,500);

>> [lab,w]=emclust(trainset,nmc,3); % initial set of labels

>> for i=1:10, out=spatm(a*w,2)*prodc; imagesc(classim(out)); pause; ...

a=setlabels(a,out*labeld); w=nmc(a); end

Plot the number of label differences between iterations. How many iterations is needed tostabilise the algorithm using different spectral models and spatial smoothing parameters?

46

13 Summary of the methods for data generation and

available data sets

k number of features, e.g. k = 2

m number of samples (ma, mb for classes A and B), e.g. m = 20

c number of classes, e.g. c = 2

u class mean: (1,k) vector (ua, ub for classes A and B), e.g. u = [0,0]

v variance value, e.g. v = 0.5

s class feature deviations: (1,k) vector, e.g. s = [1,4]

G covariance matrix, size (k,k), e.g. G = [1 1; 1 4]

a dataset, size (m,k)

lab label vector, size (m,1)

a = rand(m,k).*(ones(m,1)*s) + ones(m,1)*u uniform distributiona = randn(m,k)*(ones(m,1)*s) + ones(m,1)*u normal distribution with diagonal covariance ma-

trix (s.*s)

lab = genlab(n,lablist) generate a set of labels, n(i) times lablist(i,:),for all values of i.

a = dataset(a,lab) define a dataset from an array of feature vectorsa and a set of labels lab, one for each datavector.Feature labels can be stored in featlab.

a = gauss(m,u,G) arbitrary normal distributiona = gencirc(m,s) noisy data on the perimeter of a circle.a = gendatc([ma,mb],k,ua) two circular normally distributed classesa = gendatd([ma,mb],k,d1,d2) two ’difficult’ normally distributed classes (pan-

cakes)a = gendath(ma,mb) two classes of Highleyman (fixed normal distribu-

tions)a = gendatm(m) two generation of m objects for each of c normally

distributed classes (the means are newly generatedat random for each call)

a = gendats([ma,mb],k,d) two ’simple’ normally distributed classes, distanced.

a = gendatl([ma,mb],v) generate two 2d ’sausages’a = gendatk(a,m,n,v) random generation by ’adding noise’ to a given

dataset b using the n-nearest neighbor method.The standard deviation is v* the nearest neigh-bour distance

a = gendatp(a,m,v,G) random generation from a Parzen density distrib-ution based on the dataset b and smoothing para-meter v. In case G is given it is used as covariancematrix of the kernel

[a,b] = gendat(a,m) Generate at random two datasets out of one. Theset a will have m objects per class, the remainingones are stored in b.

47

In the table below, a list of datasets is given that can be stored in the variable a providedprdatasets is added to the path, e.g.:

a = iris;

>> a

Iris plants, 150 by 4 dataset with 3 classes: [50 50 50]

Routines generating datasets

gauss Generation of multivariate Gaussian distributed data

gendatb Generation of banana shaped classes in 2D

gendatc Generation of circular classes

gendatd Generation of two difficult classes

gendath Generation of Higleyman classes in 2D

gendatl Generation of Lithuanian classes in 2D

gendatm Generation of 8 classes in 2D

gendats Generation of two Gaussian distributed classes

gencirc Generation of circle with radial noise in 2D

lines5d Generation of three lines in 5D

boomerang Generation two boomerang-shaped classes in 3D

Routines for resmapling or modifying given datasets

gendatk Nearest neighbour data generation

gendatp Parzen density data generation

gendat Generation of subsets of a given dataset

Routines for loading public domain datasets

x80 45 by 8 with 3 classes: [15 15 15]

auto_mpg 398 by 6 with 2 classes: [229 169]

malaysia 291 by 8 with 20 classes

biomed 194 by 5 with 2 classes: [127 67]

breast 683 by 9 with 2 classes: [444 239]

cbands 12000 by 30 with 24 classes: [500 each]

chromo 1143 by 8 with 24 classes

circles3d 100 by 3 with 2 classes: [50 50]

diabetes 768 by 8 with 2 classes: [500 268]

ecoli 272 by 7 with 3 classes: [143 77 52]

glass 214 by 9 with 4 classes: [163 51]

heart 297 by 13 with 2 classes: [160 137]

imox 192 by 8 with 4 classes: [48 48 48 48]

iris 150 by 4 with 3 classes: [50 50 50]

ionosphere 351 by 34 with 2 classes: [225 126]

liver 345 by 6 with 2 classes: [145 200]

mfeat_fac 2000 by 216 with 10 classes: [200 each]

mfeat_fou 2000 by 76 with 10 classes: [200 each]

mfeat_kar 2000 by 64 with 10 classes: [200 each]

mfeat_mor 2000 by 6 with 10 classes: [200 each]

mfeat_pix 2000 by 240 with 10 classes: [200 each]

mfeat_zer 2000 by 47 with 10 classes: [200 each]

mfeat 2000 by 649 with 10 classes: [200 each]

48

nederland 12 by 12 with 12 classes: [1 each]

ringnorm 7400 by 20 with 2 classes: [3664 3736]

sonar 208 by 60 with 2 classes: [97 111]

soybean1 266 by 35 with 15 classes

soybean2 136 by 35 with 4 classes: [16 40 40 40]

spirals 194 by 2 with 2 classes: [97 97]

twonorm 7400 by 20 with 2 classes: [3703 3697]

wine 178 by 13 with 3 classes: [59 71 48]

Routines for loading multi-band image based datasets (objects are pixels, features are imagebands, e.g. colours)

emim31 128 x 128 by 8

lena 480 x 512 by 3

lena256 256 x 256 by 3

texturel 128 x 640 by 7 with 5 classes: [128 x 128 each]

texturet 256 x 256 by 7 with 5 classes:

Routines for loading pixel based datasets (objects are images, features are pixels)

kimia 216 by 32 x 32 with 18 classes: [ 12 each]

nist16 2000 by 16 x 16 with 10 classes: [200 each]

faces 400 by 92 x 112 with 40 classes: [ 10 each]

Other routines for loading data

prdataset Read dataset stored in mat-file

prdata Read data from file

Some datafiles:

delft_idb 256 9 Delft Image Database

delft_images 619 Delft Images

mnist 2000 10 MNIST train set and testset of handwritten digits

nist 28000 20 Raw NIST handwritten digit database

orl 400 40 Standard face database

roadsigns 332 Scenes with roadsigns

highway 100 2 Pixel labeled highway scenes

flowers 1360 17 Flower images

49

−6 −4 −2 0 2 4

−6

−4

−2

0

2

4

Spherical Set

−2 −1 0 1 2 3 4−5

−4

−3

−2

−1

0

1

2

3

4

Highleyman Dataset

a = gendath([50,50]); scatterd(a); a = gendatc([50,50]); scatterd(a);

−2 −1 0 1 2 3 4 5 6

−2

−1

0

1

2

3

Simple Problem

−5 0 5 10

−6

−4

−2

0

2

4

6

8

Difficult Dataset

a = gendatd([50,50],2); a = gendats([50,50],2,4);

scatterd(a); axis(’equal’); scatterd(a); axis(’equal’);

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

Spirals

−10 −5 0 5−10

−8

−6

−4

−2

0

2

4

6

Banana Set

a = spirals; scatterd(a); a = gendatb([50,50]); scatterd(a);

50

a = faces([1:10:40],[1:5]); a = nist16(1:20:2000);

show(a); show(a);

5000 6000 7000 8000 9000 10000 110003000

4000

5000

6000

7000

8000

9000

10000

a = faces(1:40,1:10); a = faces([1:40],[1:10]);

w = pca(a,2); w = pca(a);

scatterd(a*w); show(w(:,1:8));

60 80

50

60

70

80

sepa

l len

gth

20 30 40

50

60

70

80

20 40 60

50

60

70

80

0 10 20

50

60

70

80

60 8020

30

40

sepa

l wid

th

20 30 4020

30

40

20 40 6020

30

40

0 10 2020

30

40

60 80

20

40

60

peta

l len

gth

20 30 40

20

40

60

20 40 60

20

40

60

0 10 20

20

40

60

60 800

10

20

sepal length

peta

l wid

th

20 30 400

10

20

sepal width 20 40 60

0

10

20

petal length0 10 20

0

10

20

petal width

a = iris; a = texturet;

scatterd(a,’gridded’); show([a getlab(a)],4);

51

Documents

Exercises with PRTools - Javeriana Calicic.puj.edu.co/wiki/lib/exe/fetch.php?media=grupos:destino:0... · Exercises with PRTools ASCI ... C. Veenman Information and Communication