Workbook Pattern Recognition - alpha.imag.pub.roalpha.imag.pub.ro/~rasche/course/patrec/patrec1.pdf · Workbook Pattern Recognition An Introduction for Engineers ... The emphasis

Workbook Pattern RecognitionAn Introduction for Engineers and Scientists

C. RascheApril 25, 2018

This workbook provides a rapid, practical access to the topic of pattern recognition. The emphasis lies onapplying and exploring the statistical classification methods in Matlab or Python; the mathematical formu-lation is minimal. Plenty of code examples are given that allow to immediately play with these methods andthat can serve as a reference guide (even the author uses the them as such). We start with the very simpleand easily implementable k-Nearest-Neighbor classifier, followed by the popular and robust linear classi-fiers. We learn how to apply the Principal Component Analysis and how to properly fold the data. We thenintroduce clustering methods (K-Means and hierarchical algorithms), decision trees, ensemble classifiersand string matching methods. After having introduced those basic techniques we expand by introducingthe modern classifiers such as Support Vector Machines and Deep Neural Networks, and we explain whenit is meaningful to employ them. Analogously, we expand on clustering methods and introduce modernclustering methods such as the density-based methods. During the entire discourse we explain how todeal with very large datasets.

Prerequisites basic programming skillsRecommended basic linear algebra, basic signal processing

Speed Links

Task Example Code Check ListClassification F.1 16.5Clustering F.2 18.3

Data Preparation F.3Folding Explicitly F.5 (Knn Example)Distance Measures A

Contents

1 Introduction 81.1 The Principal Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Other Recognition Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Data Format, Formalism & Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Types of Feature Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4 Classification - The Evaluation Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.1 User Interface for Classification (Matlab) . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 Clustering Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 Varia: Code, Software Packages, Training Data Sets . . . . . . . . . . . . . . . . . . . . . . . 15

2 Data Preparation (Loading, Inspection, Adjustment, Scaling) 162.1 Visual Inspection, Group Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Special Entries (Not a Number, Infinity, etc.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1

2.4 Permute Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.5 Load Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 k-Nearest Neighbor (kNN) 213.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Linear Classifier (I) 254.1 Covariance Matrix Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Implementation (Matrix Decomposition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Dimensionality Reduction 315.1 Feature Transformation - PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1.2 Choice of number of principal components (k) . . . . . . . . . . . . . . . . . . . . . . . 335.1.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Evaluating and Improving Classifiers 356.1 Types of Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Validation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2.2 Measures and Response Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.3 Three or More Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.4 More Tricks & Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.4.1 Class Imbalance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.2 Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.4.3 Improvement with Hard Negative Mining and Artificial Samples . . . . . . . . . . . . . 41

7 Clustering - K-Means 427.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.3 Determining k - Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8 Clustering - Hierarchical 468.1 Pairwise Distances, Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468.2 Linking (Agglomerative) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478.3 Thresholding the Hierarchy - Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

9 Decision Tree 519.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2

10 Ensemble Classifiers 5510.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5510.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5610.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5710.4 Component Classifiers without Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . 5710.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.6 Learning the Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.7 Error-Correcting Output Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.8 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

11 Recognition of Sequences 5911.1 String Matching Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5911.2 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

12 Density Estimation 6112.1 Non-Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

12.1.1 Histogramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6212.1.2 Kernel Estimator (Parzen Windows) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

12.2 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6312.2.1 Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

12.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

13 Support Vector Machines 6513.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6513.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

14 Deep Neural Network (DNN) 6714.1 Traditional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

14.1.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6814.2 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

14.2.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6914.3 Deep Belief Network (DBN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

14.3.1 Usage in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7014.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

15 Naive Bayes Classifier (Linear Classifier II) 7115.1 Usage in Matlab, Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7115.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

16 Classification: Rounding the Picture & Check List 7316.1 Bayesian Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

16.1.1 Rephrasing Classifier Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7316.2 Estimating Classifier Complexity - Big O Notation . . . . . . . . . . . . . . . . . . . . . . . . . 7416.3 Parametric (Generative) vs. Non-Parametric (Discriminative) . . . . . . . . . . . . . . . . . . 7416.4 Algorithm-Independent Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7516.5 Check List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

17 Clustering III 7717.1 Partitioning Methods II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

17.1.1 Fuzzy C-Means (K-Means) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7717.1.2 K-Medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

17.2 Density-Based Clustering (DBSCAN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7817.2.1 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8017.2.2 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

17.3 Very Large Data Bases (VLDB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3

17.3.1 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8117.3.2 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

17.4 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8217.4.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8317.4.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

18 Clustering: Rounding the Picture 8518.1 Summary of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8518.2 Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

18.2.1 Test for Spatial Randomness - Hopkins Test . . . . . . . . . . . . . . . . . . . . . . . . 8618.3 Check List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A Distance and Similarity Measures 88A.1 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88A.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

B Gaussian Function 90

C Varia 91C.1 Programming Hints for Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91C.2 Parallel Computing Toolbox in Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

D Matrices (& Vectors): Multiplication and Special Matrices 92D.1 Dot Product (Vector Multiplication) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92D.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92D.3 Appendix - Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

E Reading 95

F Code Examples 97F.1 The Classifiers in One Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97F.2 The Clustering Algorithms in One Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100F.3 Prepare Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

F.3.1 Whitening Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104F.3.2 Loading and Converting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104F.3.3 Loading the MNIST dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

F.4 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106F.4.1 Calculating Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

F.5 Classification Example - kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107F.5.1 kNN Analysis Systematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

F.6 Estimating the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109F.7 Classification Example - Linear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110F.8 Study Cases for PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112F.9 Example Ranking Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113F.10 Function k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114F.11 Example ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

F.11.1 ROC Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116F.12 Clustering Example - K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

F.12.1 Cluster Information and Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118F.13 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

F.13.1 Three Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121F.14 Classification Example - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124F.15 Classification Example - Ensemble Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125F.16 Classification Example - Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127F.17 Example Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4

F.17.1 Histogramming and Parzen Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128F.17.2 N-Dimensional Histogramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128F.17.3 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

F.18 Classification Example - Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130F.19 Classification Example - SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131F.20 Clustering Example - Fuzzy C-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132F.21 Clustering Example - DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134F.22 Clustering Example - Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5

Preface

There are many, wonderful textbooks on the subject of pattern recognition, but most can be assigned totwo extreme categories: those starting with the mathematics first and and those explaining how to usea software package. The mathematical textbooks often provide the theoretical background first, followedby giving some examples, while the practical tips appear rather spontaneous, erratic and scarce: thatimbalance can deter the impatient scientist or engineer to approach the subject thoroughly. In contrast, thesoftware-oriented textbooks do not give you sufficient explanation on how you can manipulate the data foryour own purpose. The following workbook aims in between: it provides a learning-by-doing approach withwhich you will be able to understand the basics and which will hopefully allow you to develop your ownclassifiers that can outperform the standard techniques.

Motivation There is a deeper motivation to provide such a workbook. I have met researchers and tech-preneurs who were not firm with some of the basics of pattern recognition: some did not understand fun-damental differences in recognition; others frantically tried to apply the newest classification methods (suchas Deep Neural Networks) without verifying whether the old ones deliver satisfying results. But the mostrecently developed classification methods do not always provide better results; and if so, then often at theprice of a much larger effort, be that time, computational resources or less robustness. Figure 1 showshow classification accuracy is related to the complexity of the method: the improvement is saturating. Thus,for an easily classifiable problem a more complex method will likely show a better performance; but for amore difficult-to-classify problem, it will also show an improvement only. It is therefore worth to understand

Complexity of Method

[Resources]

Accuracy

easy task

di!cult task

Figure 1: Performance gain in pattern recognition: the more complex the method (and the larger the resources), thebetter the recognition accuracy - in most but not all cases. But that gain in accuracy typically saturates with increasingeffort. Put differently, complex methods often solve tasks better than simple methods, but they will not solve a difficulttask easily - they will only improve in comparison to the simple methods. You can obtain a good impression of thenature of your data even with simple methods.

the basics well and apply those first - they will usually get you very far in very short time; and those quickresults often allow you to make early decisions in your analysis that are sometimes necessary to continueyour research in a specific direction. Only if you intend to optimize a task or if you need to outperformyour competition, only then you move on to more modern, advanced classifiers such as Support VectorMachines (SVM) and Deep Neural Networks (DNN). The workbook attempts to teach you those basics asstraightforward and sound as possible, but also explains how to use the modern classifiers.

6

Limited Time or Resources Sometimes your data is so large, that you simply lack the time or resourcesto classify everything thoroughly and you need to concentrate on one part of it. But which part? Whichsub-selection of your data would be still representative for your entire set? Again, you can obtain a goodidea by applying the basic techniques first, which may deliver sup-par results only, but which allow you toidentify the most representative part of your data. This works well, because the advanced classifiers (SVMor DNN) do not carry out magic - they merely improve the results (Figure 1): it is unlikely that the advancedclassifiers would identify a different part of your data as the best sub-selection.

Big Data, Deep Neural Networks - On Modern Terms New terms such as Big Data or Deep Neural Net-works sometimes suggest breakthroughs, but progress in engineering or science occurs mostly step-wise;conceptual breakthroughs are rare. Younger experts generate new terms to promote their contributionsto the field; some older experts tend to belittle such new terminology; this is however the natural cycle ofprogress and promotion - as it occurs anywhere else, be it in business, in the art scene or in the musicindustry.

Algorithms that deal with very large databases (VLDB) - or call it Big Data - were invented decadesago already, but with modern computers one can tackle larger data more easily. Similarly, neural networkswith three or more layers had been envisioned since the 60s - and occasionally tested -, but it is thecomputational power of modern computers which allows to use them systematically and on a large scale.And that is certainly exciting per se and justifies the hype.

Related Fields There are some fields that are related to Pattern Recognition, which however are hard toseparate because they overlap so much in content. Two closely related fields are the following and theirdescription should be regarded as tendential and not as absolute.

Machine Learning corresponds to the field of Pattern Recognition but focuses a bit more on supervisedand reinforcement learning. Some experts consider Pattern Recognition as a part of the field ofMachine Learning. wiki Machine Learning

Data Mining focuses in particular on unsupervised learning (clustering) of large datasets; often, the tradi-tional techniques of the field Pattern Recognition are modified such, that they are capable of dealingwith those large datasets. wiki Data Mining

What to use from the WorkbookBasics: The first 7 sections comprise the basics - the traditional techniques; they cover the principal

classification and clustering techniques. With those you obtain results very quickly and those mayalready allow you to make decisions to move forward more rapidly.

Basics for ’Quantized’ Data: It may be the case that your data has a certain ’quantized’ character, thatis the typical distance measurements such as ’a minus b’ are not optimal. In that case you may tryDecision Trees (Section 9), where comparisons are based on relations such as ’a greater b’ or ’asmaller equal b’. If you attempt clustering on data that possess hierarchical character, you may try thehierarchical clustering methods (Section 8).

Basics - Refinement: By playing with certain tricks you may be able to improve the performance of thebasic techniques (Section 10) - in particular if your data comes from different sources. With densitymethods you can analyze low-dimensional data in more detail (Section 12).

Optimization: Use modern classifiers to achieve higher classification performance (Sections 13 and 14).But be prepared to spend much more time to obtain your better results - and be prepared to ’upgrade’your computational resources. And be advised that in some cases the results are worse than usingthe basic techniques - no matter how well you tune the modern classifiers. For that reason it isrecommended to always apply the basic techniques for a comparison.

7

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/Data_mining

1 Introduction

1.1 The Principal Recognition Tasks wiki Pattern Recognition

The two most common recognition tasks are classification and clustering. Classification is roughly ex-pressed the discrimination (judgment) based on experience; clustering in contrast is an exploration, a for-mation of an opinion with no or little information or knowledge about the problem at hand.

In the task of classification, we try to predict newly observed data based on a model we have createdbefore. That model was created with data that we have collected and labeled before - it represents theexperience. For example, if we wish to build a model that can read postal codes automatically, then wecollect samples of handwritten digits of many different persons, then label the digits, and then build a modelthat discriminates the digits. When we apply the model to handwritten postal codes of other persons, thatis to new samples, then the model will predict those samples. The process of training a model is alsocalled learning or fitting. And because in this task the learning process takes place with class labels, this istherefore also called supervised learning.

In the task of clustering in contrast, we do not have any class information: we explore, and we tryto find trends in data that could correspond to classes. For example, if you were to learn the Chinesesymbols without any translation - that is without any class information -, it would make sense to organizethose symbols into groups, that are visually similar and that therefore might share some semantic meaning.Because during this group-finding process we do not know the meaning of the symbols - their class labels-, this type of learning is also called unsupervised learning.

In the following we elaborate on those two types of recognition tasks, followed by mentioning other recog-nition tasks (Section 1.2). Then we introduce the typical format of data used in software (Section 1.3) andoverview how classification and clustering is carried out (Section 1.4 and 1.5).

Figure 2: Illustrating the classification prob-lem in two dimensions. We are given two set-s of points representing two classes, squaresand triangles, respectively: they are our trainingsamples (example data). To which group wouldwe assign a new sample (testing point) such asthe one marked as a circle?The two classes may overlap due to measure-ment noise or because some samples containaspects of the other class and that makes clas-sification difficult. Nevertheless, we would liketo predict a new sample as well as possible.In the simplest case we compare the testingsample with all training samples (Section 3). Orwe may attempt to model the point clouds withfunctions, such as a Gaussian function (Sec-tion 15). We could also try to find a straightline equation which separates best the two pointclouds (Section 4). Of course, each method hasits advantages and disadvantages - there is no’best classifier’.

Classification (Supervised Learning): Look at Figure 2: there are two sets of points visually labeledas squares and triangles, that is we have two classes (or categories or groups). To which group wouldyou assign the circle? This is difficult to say by eye, but the methods for supervised classification can helpus to create a model that makes an optimal decision based on statistics. Those statistical methods are inparticularly necessary when we tackle data points in higher-dimensional spaces, in three dimensions, intens of dimensions or even in thousands of dimensions. To train the model we use so-called supervised

8

learning algorithms that exploit the labels. One could say, training takes place with help of a ’teacher’. Inthis workbook, we will first treat this type of supervised learning (Sections 3 and 4).

Clustering (Unsupervised Learning) wiki Cluster analysis In this task, we are given data and we try to makesense of them - we try to find meaningful patterns, groups, trends, structures, partitions, etc., which hereare called clusters. There are roughly three main purposes for clustering:- Finding structure: to gain insight into data, generate hypotheses, detect anomalies, and identify salient

features.- Natural classification: to identify the degree of similarity among forms or organisms (phylogenetic rela-

tionship).- Compression: as a method for organizing the data and summarizing it through cluster prototypes.

Clustering is used in the following fields for example:

- engineering: artificial intelligence, mechanical engineering, electrical engineering- computer sciences: web mining, spatial database analysis, textual document collection, computer vision- life and medical sciences: bioinformatics, biology, microbiology, paleontology, psychiatry, clinic, pathology- earth sciences: geography, geology, remote sensing- social sciences: sociology, psychology, archeology, education- economics: marketing, business

Algorithmically the problem is as follows, see Figure 3: there is a set of points, but we lack class information,it is so-called unlabeled data: it appears there exist two ’clusters’; perhaps there are even three? Forexample, an internet company tries to recognize trends in access of their website in order to adjust theirservices: when do what type of customers access their website? Are there perhaps two or three categoriesof customers that tend to access the website at different hours? Because there is no ’teacher’ in this typeof classification problem, it is said that the clustering algorithms perform unsupervised learning.

Figure 3: Illustrating the clustering problem in 2D.We are given a set of points and we attempt to finddense regions in the point cloud, which likely corre-spond to potential classes. Are there two, three ormore classes in the point cloud?Intuitively one would like to ’smoothen’ the distribu-tion (Section 12) or to measure all point-to-point dis-tances to obtain a detailed description of the pointdistribution (Section 8), which however is computa-tionally very intensive for large dimensionality; forlarge dimensionality or datasets we therefore use’simpler’ procedures such as the K-Means algo-rithm (Section 7).

We give two specific examples to further motivate the use of clustering methods (from ThKo p598):

Business example for hypothesis testing: cluster analysis is used for the verification of the validity of aspecific hypothesis. Consider, for example, the following hypothesis: ’Big companies invest abroad.’One way to verify whether this is true is to apply cluster analysis to a large and representative set ofcompanies. Suppose that each company is represented by its size, its activities abroad, and its abilityto complete successfully projects on applied research. If, after applying cluster analysis, a cluster

9

https://en.wikipedia.org/wiki/Cluster_analysis

is formed that corresponds to companies that are large and have investments abroad (regardless oftheir ability to complete successfully projects on applied research), then the hypothesis is supportedby the cluster analysis.

Medical example for prediction based on groups: cluster analysis is applied to a dataset concerning pa-tients infected by the same disease. This results in a number of clusters of patients, according to theirreaction to specific drugs. Then for a new patient, we identify the most appropriate cluster for thepatient and, based on it, we decide on his or her medication.

We start with a simple but popular clustering algorithm in Section 7, then treat a more complex one inSection 8 and finally address more general issues of clustering in Section 17.

1.2 Other Recognition Tasks

There exist other recognition challenges, which however are treated often marginally in textbooks, for vari-ous reasons.

Dimensionality Reduction This is also sometimes considered as a separate task of pattern recognition,but most textbooks regard this as an optimization of classification and clustering. We treat that topic inSection 5.

Reinforcement Learning Alp p447 Is the task to make the system adapt over time to new challenges whenthere are no clear supervised signals, that is, one lacks class labels that would allow supervised learning,but there is a feedback that can be exploited to sense the right learning direction. We do not introduce thistopic.

Neural Networks In earlier years, the Neural Network methodology was sometimes considered its ownpattern recognition challenge, but one can also assign the methodology to the domain of supervised clas-sification. And with the success of Deep Neural Networks, the methodology has firmly gained its place insupervised classification. We do treat it quickly in Section 14.

Regression Analysis Estimates the relationship between predictor variables, such as in trying to fit a linethrough the point cloud in Fig. 3, in which case we relate y-coordinates and x-coordinates. In high-school,we had learned how to use the method of minimizing the squared error, the deviation in y. Regressionanalysis is used in particular in forecasting, for instance we want to know tomorrow’s temperature, a con-tinuous value. Thus, in regression we do not predict by seeking labels as in classification, instead we tryto find relate variables to predict continuous values. We can take that continuous value and assign it alabel, such as 15 degrees is classified as cold. For that reason there are many similarities between themethods of classification and regression, sometimes they are essentially the same. Regression is thereforesometimes also considered ’supervised learning’. We do not treat regression for reason of brevity. If onehas understood how to use the classification and clustering algorithms, then one should have no difficultiesto apply regression algorithms.

10

1.3 Data Format, Formalism & Terminology

Data often consist of so-called samples or also called observations, which were collected with the samenumber of measurements. For instance in computer vision, an image is a sample, its number of pixelscorresponds to the number of measurements. And because the number of measurements is often thesame, one can conveniently represent the data in a two-dimensional array, namely a m×n matrix D, whichtypically is organized as follows:

rows

[samples,

observations]

1

2

3

4

m

columns

[features, variables, predictors,

components, dimensions]

1 2 3 n. .

.

.

group,

class,

category

1

2

3

1

3

1

2

2

Figure 4: On the left: a m × n data matrix (or n × d or N × d): abbreviated D or X or other letters. On the right: agroup variable - if available -, which holds the class or group label for each sample data point.Rows: numbered 1 through m; they represent samples or observations: images, objects, a set of measurements orexpressed statistically: each row represents a data point in a multi-dimensional space. The number of samples m issometimes also denoted as n or N in other texts.Columns: numbered 1 through n; they are called features, variables, predictors, components or dimensions - depend-ing on the context or preferred terminology: they can be the individual pixels of an image, the measurements of anobject, etc.Individual matrix entries are denoted as dij or xij for example, with index variables i and j counting 1 to m and n,respectively.The group variable (on the right) is often abbreviated Y and holds three classes in this case. If such group informationis present, we can use supervised learning algorithms to make better predictions.

Each row of D describes a data sample (or observation), that is a vector d, of which each dimension d(j)(or component) - sometimes also denoted as dj - represents the measurement of a different feature (orvariable), j = 1..n. Thus, each sample is a point in n-dimensional space; in Figure 2, it is a two-dimensionalspace only and the data matrix would have only two columns. In praxis, the dimensionality can rangefrom two to several thousand dimensions. In computer vision for example, the image’s individual pixelsare often taken as dimensions, that is for a 200x300 pixel image we have 60’000 dimensions (or features,variables,...). In Bioinformatics the dimensionality can also easily grow to several thousands, in particular inDNA microarray analysis; in Webmining as well.

If we know the class labels of our samples, then that information is organized in a group variable Ywhose number of elements is the same as the number of samples (see column in right part of Figure 4).

11

Other Notations Notations of matrices and group variables may vary. The one above is the typical math-ematical notation, but some books or software programs may also talk of a n× d matrix, with n the numberof samples, and d the dimensionality of the data; or N × d. Or other variable names are used. This varietyis of course confusing sometimes, but the traditional mathematical notation using X and Y is not very infor-mative and that is why it is popular to choose more informative variable names. In rare cases, the axes ofthe data matrix are flipped - rows are features and columns are samples.

In this workbook we do not pursue a consistent notation, as we took formulations from different textbooksand did not adapt the notation. We prefer the letter G for the grouping variable instead of the letter Y .However, the provided example code is fairly consistent and its notation will be introduced soon (Section1.4).Note: Sometimes the term ’feature vector’ is used in the literature, which stands for a sample d, and not fora column, even though the name suggests it.

1.3.1 Types of Feature Values ThKo p599, 11.1.2

HKP p40, 2.1

Typically we associate the idea of a measurement value with a number that can be compared to othermeasurement values by taking their arithmetic difference. Features with such values would be called quan-titative and are often continuous. But there exist also feature values with other characteristics - or valuesmay be simply missing. We attempt to summarize the prevalent types, although there exists no absoluteagreement on their exact definitions, i.e. wiki Level of measurement.

Quantitative (Real-Valued, Numeric): Familiar quantities are for instance mass, time, distance, heat andangular separation. Thus here we deal mostly with continuous values, but measurements can also bediscrete, meaning the number of possible values is limited.

Qualitative (Nominal, Categorical): Examples are nationality, ethnicity, genre, style, etc. In that case,arithmetic differences do not really make sense, with the exception if the quality is described by twopossible values only, see binary next.

Binary: Feature values take only two values, zero or one for instance, as in a computer; or ’false’ and ’true’- Matlab allows to specify such values. Binary features can be regarded as a categorical variable withtwo possible values only. Taking the arithmetic difference between two binary values is possible, butnot always optimal: there exist specific difference measures for binary data, that can improve yourpattern recognition results.

Missing Data It may be the case that there are missing data, meaning a sample may lack one or severalcomponent values. There can be various reasons for that: either the measurement was not possible orit is inadequate for the specific sample to have a component value. In that case, it is most appropriateto fill in NaN entries (not-a-number). Software packages can deal with NaN entries in general, but notnecessarily all algorithms; for instance, some algorithms will eliminate all features (variables) that containany NaN entries. This elimination of entire columns is a rather simple work-around and may cause the lossof precious information of non-missing values: it could be beneficial to deal with NaN somehow.

Heterogeneous Data It is not unusual that the data consist of features of different types, e.g. consistingof continuous (quantitative) and binary variables. In a first step it suffices to scale your data and to test themwithout separating them. But we can optimize the performance of a classifier if we test also other difference(distance) measures. How we deal with such feature values will be mentioned throughout the work book.

12

1.4 Classification - The Evaluation Principle

When we train a model for a classification task, then we would like to know how well our classifier willperform on new data. In other words, we would like to know how accurately it predicts when it is givenuntested (unknown) data, sometimes also called the generalization performance. In the example of theautomatic digit classification task, this would give us an estimation of how reliable the system is when it istested on a large crowd of people. To predict properly, we partition our dataset and its labels into a trainingand a testing set: the training set is used exclusively for training, the testing set is used for estimating thegeneralization performance. For example we split our dataset D (or X) in two equal halves, that is twoseparate sets with half the number of samples each. We train on one half and predict on the other half; thatwould give us one prediction estimate. Then we simply swap training and testing set and obtain so anotherprediction estimate. We take the mean of the two estimates and that is our estimation of prediction. This isalso called hold-out estimation or two-fold cross-validation and is one way to arrive at a prediction estimate.A more elaborate way is to use folding.

Folding The folding scheme divides the data set into n partitions, called folds. One of those folds isreserved for testing, the remaining folds are used for training. Then we rotate through the folds n timesand obtain n prediction estimates, which we then average to arrive at our mean prediction estimate. Themost common number of folds is five, that is five-fold cross-validation: one fold is reserved for testing, theremaining four folds are used for training. We will explain in Section 6, why this five-fold cross-validation isthe most popular one.

Our Notation Throughout the book we use DL to denote the training set and DT the testing set, take notethe font style of D). The corresponding group labels are denoted as GL and GT , respectively. In our codeexamples, we use the variable name DAT to name the entire data set. DAT is then split into TREN and TEST -by means of folding -, sometimes also named TRN and TST, respectively. Group labels are often named withthe abbreviation Grp or Lb.

X(DAT) is split into In Algorithms In Matlab/Python Codetraining matrix (i.e. 4 folds) DL, GL TREN or TRN; GrpTrn or Grp.Trn or LbTrntesting matrix (i.e. 1 fold) DT , GT TEST or TST; GrpTst or Grp.Tst or LbTst

Matlab Matlab provides consistent function names (as of version 2015a) to train and test classificationmodels. Its preferred terminology is called fitting and predicting. For instance for a kNN classifier the twoprincipal function names are

Mdl = fitcknn(TREN, GrpTrn); % learning a model (training)

GrpPred = predict(Mdl, TEST); % predicting new data (testing)

whereby GrpPred holds the estimated class labels. For a Support-Vector Machine, the fitting function iscalled fitcsvm, the predict function remains the same. To apply proper folding - as mentioned above - , onecan set certain options, to be introduced later. The impatient reader may already take a look at a summaryof classifiers in Appendix F.1.

Matlab often uses the mathematical X/Y notation for variable names in function scripts, with X beingthe data - training or testing -, and Y being the grouping variable.

Python - SciKit-Learn In Python’s package called SciKit-Learn the terminology and notation is mostlythe same: there, one speaks of learning and predicting and the corresponding function names are fit andpredict. Those functions are called from an estimator instance, which we call Clf in the following codesnippet:

13

from sklearn.neighbors import KNeighborsClassifier # we need to import every function we use

Clf = KNeighborsClassifier() # creating the estimator instance (a kNN classifer in this case)

Clf.fit(TREN, GrpTrn) # learn

Clf.predict(TEST) # test

1.4.1 User Interface for Classification (Matlab)

Software packages become increasingly more convenient to evaluate a classification task. Matlab providesthe script classificationLearner to test a range of classifiers. All you need to provide is the the datamatrix and the group vector. Then you can classify your data using different classifiers choosing from amenu. This is very useful for orienting: this can very quickly point toward the best-classifying model thatwould be suitable to classify your data. The downside is that the program is generic and that you cannotfine-tune your system. For that there exists an option that generates the code of your preferred classifiermodel and that code you could then manipulate toward your specific goals and needs. The workbookteaches how to manipulate that code.

1.5 Clustering Overview

In clustering there is no evaluation phase of the model. Instead, we cluster and then analyze the charac-teristics of the output: we analyze the obtained cluster sizes, their center values, their variances, etc. Formost clustering algorithms we need to specify a parameter that expresses our expectation of the clustercount or cluster size we expect. And because we often do not know exactly what to expect, we often run theclustering algorithm for a range of parameters values and then decide based on the output analysis, whichcluster formation appears to be the most suitable.

Matlab There is also no particular naming of the functions as in classification. We simply apply a function.

Pyton SkiKit-Learn’s function follow the naming of the classification function. There is a function fit thatruns the algorithm.

Appendix F.2 shows how to apply the clustering algorithms in their simplest form.

14

1.6 Varia: Code, Software Packages, Training Data Sets

Source for this Workbook I have tried to compile the best pieces from each textbook and I provide exactcitations including page number, see Appendix E for a listing of titles. My workbook distinguishes itself fromthe textbooks by specifying the use of some of the procedure more explicitly; and by providing code, whichis optimized for the use in a high-level language, such as Matlab.

Mathematical Notation The mathematical notation in this workbook is admittedly a bit messy, because Itook equations from different textbooks. I did not make an effort to create a consistent notation, so that thereader can easily compare the equations to the original text. In the majority of textbooks a vector is denotedwith a lower-case letter in bold face, e.g. x; a matrix is denoted as an upper-case letter in bold face, e.g. Σ.But there are deviations from this notational norm.

Code The code fragments I provide are written in Matlab and Python; in other languages most of thecommands have the same or a similar name. The computations in Matlab are written in vectorized form,which is equivalent to matrix notation: this type of vector/matrix thinking is unusual at the beginning, buthighly recommended for three reasons: 1) the computation time is shorter than if one uses for-loops; 2) codeis more compact; 3) code is less error-prone. However, some code fragments may contain unintendedmistakes, as I copied/pasted them from my own Matlab scripts and made occasionally some unverifiedmodifications for instruction purposes. The same holds for my Python scripts. For Python I use the SkiKit-Learn package (Pedregosa et al., 2011).It can also be useful to check Matlab’s file exchange website for demos of various kinds:

http://www.mathworks.com/matlabcentral/fileexchange

Some Software Packages Here are some other languages, that should be capable to serve your needs.The Weka software is a pattern recognition software that allows you to use the algorithms using a userinterface; however you may be stripped off flexibility.

MatLab Unfortunately expensive and mostly available either in academia or industry.R Free software package supposed to be a replacement for MatLab. wiki R (programming language)Python High-level language similar to Matlab, but more explicit in coding. wiki Python (programming language)

see in particular http://scikit-learn.org/stable/Weka Free software package written in Java. wiki Weka (machine learning)

Example Datasets Here are links to example datasets that have been used throughout machine learninghistory and that appear often in text books. Some of those datasets are rather small, but it can be convenientto get your classifier first working on such a small dataset and then to move on to real data. We providehere some links but we did not check whether they are still valid. In Python, many of those data sets comewith the module sklearn.datasets, see SKL page 540, section 3.5.

- Iris flower dataset: consists of 150 samples, each one with 4 attributes (dimensions):http://mlearn.ics.uci.edu/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

It also exists multiple times in Matlab: once as part of the statistics toolbox: load fisheriris; and aspart of the fuzzy logic toolbox: load iris.dat. Python has it as datasets.load iris().

- Handwritten digits (MNIST database): consists of 70000 digits, each one 28 x 28 pixels:http://yann.lecun.com/exdb/mnist/. In Python as datasets.load digits(), though I think that

command loads only a subset of those data

- Bishop provides also other collections, Bis p677:http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/datasets.htm

Matlab offers quite a number of datasets, see http://www.mathworks.com/help/stats/_bq9uxn4.html.

15

http://www.mathworks.com/matlabcentral/fileexchange

http://scikit-learn.org/stable/

http://mlearn.ics.uci.edu/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://yann.lecun.com/exdb/mnist/

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/webdatasets/datasets.htm

http://www.mathworks.com/help/stats/_bq9uxn4.html

2 Data Preparation (Loading, Inspection, Adjustment, Scaling) HKP p39, Ch 2

ThKo p262, 5.2

Given a new data set, it is best to inspect it and to adjust it as appropriate as possible. The data maycontain entries that are difficult to deal with for classification algorithms, such as missing entries in formof blanks, Not-a-Number (NaN) or other place-holders; there may exist ’unbalanced’ features; there maybe dimensions with zero values only, etc. Software programs will perform some of the adjustment auto-matically, but they tend to eliminate difficult samples or features entirely, therefore most likely lowering therecognition accuracy as you throw away imperfect samples that could carry useful information neverthelessin its remaining attributes. This section helps to adjust your data to optimally classify it. Appendix F.3 has ascript that applies the functions that we will mention in the following.

Most of the data files can be opened with the Matlab command importdata. In Python you shouldconsult the modules scipy.io or numpy/routines.io. Section 2.5 gives more details and also advice onhow to organize your data if it consists of several files.

Ensure that your data has the typical orientation right from the beginning, namely observations-by-variables (see Figure 4), otherwise it can become confusing in later stages of the analysis. Transpose yourmatrix as follows:

DAT = DAT’; % if necessary flip the matrix to format observations-x-variables (Matlab)

DAT = DAT.transpose() # (Python)

Now you visually inspect your data, Section 2.1 introduces some standard techniques. Then you check formissing values or other special entries, discussed in Section 2.2. Finally, scaling your data helps improving(prediction) accuracy in many cases, so it can be worthwhile trying different schemes, see Section 2.3.

2.1 Visual Inspection, Group Statistics

Display Data as Image Because our data is a matrix, a two-dimensional array, it can be convenientlydisplayed as an image. In Matlab you use the function imagesc for that, in Python imshow. If your data isreally large, then simply choose a subset of it - it still may obtaining a quick impression of your data. Takingmore than several thousands samples and dimensions probably does not make much sense, becauseyour screen resolution is limited, but you may try nevertheless. For very large data this is however notrecommended, your display may lack the necessary memory size.

figure(1);clf;imagesc(DAT(1:1500,1:2000));colorbar;

from matplotlib.pyplot import imshow, figure, colorbar

figure(figsize=(4,40)); imshow(DAT); colorbar()

By this visual analysis, one can often observe certain data characteristics, for instance whether some of yourfeatures have a low range of values; whether they have only few non-zero values; perhaps they are evenbinary; that is, it can some idea about the types of features, that are present (Section 1.3.1). Sometimesone can already recognize the classes, if the samples are organized class-wise.

Descriptive Statistics The next step would be to observe simple descriptive statistics, for example by cal-culating the mean, standard deviation, range, etc. The command boxplot plots those values immediately:

figure(2);clf;boxplot(DAT);

from matplotlib.pyplot import boxplot

boxplot(DAT)

The black, stippled line outlines the range; the blue box delineates the standard deviation; the red linedenotes the median value.

16

Feature Value Distribution To look at the data in more detail we could then employ histograms, i.e.observe the histogram for the first variable:

figure(3);clf;hist(DAT(:,1));

from matplotlib.pyplot import hist

figure(); hist(DAT[:,0])

We return to that in the section of density estimation (Section 12). It also merits to check whether thevariables co-vary, but we elaborate on that later (Section 4.1).

Unique Values With the function unique you obtain a list of only those values that are actually used inyour distribution of values. Perhaps one of your feature dimensions is a repetition of the same three values,i.e. 3, 4.5 and 5.7. That is useful to know in order to choose appropriate classifier parameters. We can usethis observe what classes are present in the group variable:

LbU = unique(Grp); % the class/group labels

nGrp = length(LbU); % number of groups

histc(Grp,LbU) % histogram of group members/class instances

from numpy import unique

GrpU = unique(GrpLb)

nGrp = len(GrpU)

2.2 Special Entries (Not a Number, Infinity, etc.)

It is not unusual that data contains missing entries indicated by NaN (not-a-number) or some other placeholder. If you generated the parameter, perhaps you created ’accidentally’ an infinity entry (Inf). You needto address those special entries somehow, otherwise classification can become difficult and can even returnresults that make no sense. In Matlab and Python you find for instance NaN values with the function isnan,see the example in Appendix F.3.

Inf/NaN Those entries are generated when there is a division by zero, in which case Matlab will alsodisplay a warning; or when one operand is already of either type. Those two cases in more detail:

- Division by zero: Matlab returns a division-by-0 warning and creates an

Inf entry, if the divisor (denominator) is 0 (e.g. 1/0)NaN entry, if both divisor and dividend (numerator) are 0 (i.e. 0/0).

- Any operation with a NaN or Inf entry remains or produces a NaN or Inf entry.

Because most classifiers use multiplication operations, entries with NaN or Inf values will propagate throughthe computations and therefore likely return useless results. In odd cases you obtain 100% correct classi-fication, for instance if in a task with two classes, one class contains NaN entries. Here are tricks to dealwith those entries:

Avoid Inf by Division: To avoid the creation of infinity entries, one can add the smallest value possible toa divisor, e.g. type directly in Matlab

1/(0+eps)

eps is the smallest value possible in Matlab - you would add it to your variable that is acting as divisor.This trick will take the largest value possible - instead of generating an Inf entry -, thus permitting tofurther operate with the variable, as opposed to an Inf entry.

Avoid Inf by Scaling: Perhaps it is worth it to scale that feature by a function, such as tanh or otherfunction that squashes the values to a small range, see also the next Section 2.3 on scaling.

17

Inf: If you cannot avoid Inf entries, then try your classification by replacing them with the largest floatingpoint number using the command realmax or intmax.

NaN: If your data contains already not-a-number entries, then you are forced to use special commandsthat can deal with those: e.g. nanmean, nanstd, nancov, etc. A typical algorithm - implemented insome software - will simply knock out the dimensions (variables) where NaN entries occur. That is aquick solution, but you also loose those variables completely, meaning you probably lower predictionaccuracy.

Constant Variables If your data contains features whose values are all the same, then eliminate thosevariables immediately. Software implementations often will take care of that, but if eliminating them before-hand is always more elegant.

2.3 Scaling ThKo p263, 5.2.2

Your data may have features (dimensions), whose values may significantly differ in their range. For onedimension, the differences amongst values could be in the order of thousands; for another dimension,the differences could be less than some very small number. Since a classifier will try to compare thedimensions, it could therefore be beneficial to scale (or standardize) your data. For some functions (mostlyin Matlab), there is an option that allows to specify whether to standardize your data. But it is useful to knowthat there exist different possibilities to perform the scaling operation:

1. Standardization: Divide the feature values by their mean and standard deviation, for each variableseparately. The resulting scaled features will then have zero mean and unit variance. In Matlab wecan use the command zscore, in Python this is carried out by sklearn.preprocessing.scale. Thecode examples in F.3 show how to use it.

2. Scaling to range: Limit the feature values to the range of [0, 1] or [-1, 1] by corresponding scaling. Inparticular Deep Neural Networks prefer the data input in unit range [0,1]:DAT = bsxfun(@plus, DAT, -min(DAT,[],1)); % set minimum to 0

DAT = bsxfun(@rdivide, DAT, max(DAT,[],1)); % now we scale to 1

3. Scaling by function: Scale the feature values by an exponential or tangent function (e.g. tanh).

4. Whitening transformation (DHS pp 34, pdf 54): this is a decorrelation method in which we multiply each sam-ple by the covariance matrix of the dataset. The method is called ”whitening” because it transformsthe input matrix to the form of white noise, which by definition is uncorrelated and has uniform variance(see Section F.3.1 for details).

Note 1: To estimate the prediction accuracy of your classifier properly, you should determine the scalingparameters for the training set only and then scale your training and testing set separately, e.g.

[TRN Mu Sig] = zscore(TRN); % scaling and obtaining mean and standard deviation

DF = bsxfun(@minus, TST, Mu); % scaling testing set

TST = bsxfun(@rdivide, DF, Sig);

Note 2: Scaling may distort the relations between dimensions and hence the distances between samples.Therefore, scaling does not necessarily improve classification (or clustering). It may be useful to look at thedistribution of individual features (see Section 2.1 above) too see what type of scaling may be appropriate.

2.4 Permute Training Set

For some classifiers, it is important to permute your training set. If your training set is organized group-wiseor with any other regularity, then that can lead to wrong predictions. You therefore need to randomize theorder of training samples. Simply create a vector Perm with numbers ordered randomly and reorder yourdata and the group variable:

18

IxPerm = randperm(nSmp); % randomize order of training samples

DAT = DAT(IxPerm,:); % reorder training set

GrpLb = GrpLb(IxPerm); % reorder group variable

Variable nTrn is the number of training samples.

2.5 Load Your Data

Before you attempt to process and classify your data, it can be useful to format your data in a separatescript and save that formatted data separately - if it is not too large. This is particularly recommended if youhave a large number of files. There’s no general scheme of course to carry out this preparation, becauseeach dataset is individual. In the following are some tips how to organize your work and how to load thedata:

Organize It is useful to create the following folders to organize your data and scripts:

- DatRaw place your downloaded data into that folder.- DatPrep where the processed data will be saved.- Classif matlab scripts for classification and data manipulation.

Open a script called PrepRawData to prepare your raw data. You will load the data, convert them and thensave them. This is elaborated in the following paragraphs:

Loading Data Most files can be loaded with the command importdata. Should the function be insuffi-cient, due to lack of specificity for example, then one has to start looking at commands such as textscan,textread etc., see also the section entitled ’See Also’ at the end of each help document. For images thereexists the special command imread. If all fails, it is not a shame to ask a system administrator to explain toyou how to read your data. Some formats can be indeed tricky.

Data Preparation Assign your data to a matrix called DAT of format [Samples x Dimensions] for rows andcolumns (see again Section 1.3). If you have many files, you may want to initialize the matrix beforehand inorder to speed up the preparation step, e.g.

DAT = zeros(nSmp, nDim, ’single’);

Initialize a global structure variable, which contains the path names to those folders, e.g.

FOLD.DatDigits = ’C:/Classification/Data/Digits’;

With the command dir you can obtain a list of all images, e.g.

ImgNames = dir(FOLD.ImgSatImg)

and then access the filenames FileNames(1).name. The first two entries will contain ’.’ and ’..’, thus startwith FileNames(3). Use fileparts to separate the path into its components.

If you need high computing precision, use double type, instead of single:

DAT = zeros(nSmp, nDim, ’double’);

Matlab does everything in double by default, but double requires also twice as much memory. There maybe sufficient hard-disk space, but RAM is often the limiting memory (the more giga-bytes the better). Formost tasks, the datatype single is sufficient however.

19

Grouping Variable Prepare also your group variable Grp, the class/category labels for each sample. Ifyour labels are not numerical, for instance consisting only of values between 1 and the number of classes, itis recommended that you use the command grp2idx to convert your labels into that format - it will facilitatelater the classification procedure. Although Matlab is relatively flexible with labels, for instance allowingstring labels, it can become obscure to deal with this - unless one enjoys the combination of flexibility andelegance.

Appendix F.3.3 gives an example of how to load and convert. It is a function, which returns the dataalready partitioned as training and testing set with the corresponding grouping variables called Lbl.

Saving Data - And Reload Later Saving the data is simply done by the Matlab command save. This willsave the data in Matlab’s own format, which is a type of compression format. When you reload the data inyour classification script you use the command load.

Appendix F.3.2 show codes fragments to understand how to program the individual steps.

2.6 Recapitulation

1. If you generate the data yourself, i.e. by a model, then avoid infinity values (or any division-by-0) usingfor instance x/(0+eps). Squash your variables to unit range - it is most practical for classifiers.

2. Load your data with importdata, textscan, textread, etc.

3. Inspect your data visually with plotting commands such as imagesc and boxplot. Inform yourselfwhat entries the grouping variable contains - if there is one.

4. Analyze your data entries for variables containing only one value: eliminate them immediately.

5. If your data contains NaN entries, you should check how your classification algorithm handles them.In most cases, the algorithm will eliminate the corresponding columns.

6. Permute your training set. In particular for classification algorithms that learn sequentially such asNeural Networks. Or for clustering algorithms that analyze data points sequentially.

7. Consider scaling your data. It will improve performance in most cases - for some classifiers scaling isnecessary. Try different scaling schemes - there will not be big differences in results, but they couldbe significant nevertheless.

20

3 k-Nearest Neighbor (kNN)

The k-nearest neighbor algorithm is the simplest of all classifier algorithms. It makes a judgment based onits immediate neighborhood, on its nearest samples. In the illustration of Figure 2, one would measure thedistances from the testing sample - the circle - to all squares and to all triangles and make a decision basedon those distances. See now Figure 5 for a close-up of an analogous situation: again, we want to labelthe circle as either belonging to the triangle class or to the square class. For that purpose we look at its knearest neighbors, let us say the three nearest neighbors, i.e. we set k = 3. Then this would correspondto observing the neighboring points within a circle, whose center lies at the testing point (the circle). Thesmaller, gray circle in the figure represents the k = 3-neighborhood. Because it contains two triangles andone square, we would label the testing point as belonging to class ’triangle’. If we looked at k = 7, the largercircle in the figure, then we would label it as belonging to class ’square’, because those are in majority.In short, we look at the most frequent class label for the first k nearest neighbors and that is already theclassification decision, the prediction for the testing sample. This classifier is therefore rather unspectacularin an algorithmic sense; no real abstraction of the training samples is sought; one could say that no actuallearning takes place.

We summarize the classifier by expressing this a bit more formally. Given a training set, we simplystore all its samples as a reference. To classify a novel (testing) sample, we firstly measure the distancebetween that sample and all the stored training samples. Then we sort those distances. Then we choosea neighborhood size k. Then we observe the k nearest class labels of the sorted, reference samples: themost frequent class label will be the label of the testing sample.

Figure 5: The k-Nearest-Neighbor (kNN) classifier algo-rithm. We have two classes again: squares and triangles.We label a new sample (the circle) based on its closestsamples. The smaller gray circle outlines a neighborhoodwith three nearest neighbors, k= 3: if the sample (green)point would be judged by that neighborhood size then itwould be classified as red triangle. The larger circle out-lines a neighborhood with 7 neighbors (k = 7); if we judgethe sample by that neighborhood size, then it would beclassified as blue square. Algorithmically very simple in it-s decision making, practically the classifier struggles withlarge data sets, because taking a lot of distance measure-ments is time-consuming.

Evaluation As we have seen in the example of Figure 5, there can be different class labels for differentnumbers of k. What number of k gives us the best classification accuracy is hard to predict for a given dataset, and we therefore simply have to try out by exploring a range of ks. Choosing an even number of k is notuseful, because we may face parity. In practice, a k between 3 and 9 will most likely return optimal results.

A different distance metric might return also a different classification accuracy, e.g. choosing a Manhat-tan distance instead of the ’standard’ Euclidean distance (Appendix A.1). This is a ’parameter’ that one alsohas to try out.

The differences in accuracies for different ks or metrics are most likely to be subtle. One therefore hasto apply proper folding to ensure that we do indeed achieve significantly better accuracies. The principle offolding was introduced in Section 1.4.

Advantages and Disadvantages What makes the kNN classifier attractive is that not much can go wrongdue to its simplicity. Most other classifiers will generate a better classification accuracy, but the accuracy of

21

the kNN classifier can serve as a lower performance reference. If we do not achieve a better classificationaccuracy with a more complex classifier, then we might not have applied those properly.

Another advantage of this classifier is that it does not require many training samples. Even with a fewtraining samples, we can make a prediction with the kNN classifier without any concerns, whereas for otherclassifier models a prediction on few samples is always a very vague result.

The disadvantage of the kNN classifier is that it is slow when the data are large. The reason is thattaking a lot of distance measurements is time-consuming. If the dimensionality of your data is small, i.e. 10dimensions or less, then there are good solutions to the problem and software packages will automaticallychoose such solutions. We introduce those solutions in a subsequent paragraph. For larger dimensionalityhowever, the kNN classifier can become unfeasible slow, in which case you need to choose a differentclassifier.

Algorithmic Procedure We now formalize the learning and classification procedure more explicitly. Givenis a training set, a matrix TRN with corresponding group (class) labels in vector GrpTrn; and there is a testingset, a matrix TST with corresponding group variable GrpTst (see also Algorithm 1). To classify a samplefrom the testing set - one row vector of TST -, we measure the distance to all samples in TRN and placethe distance values into an array Dist. Then we sort the distances in Dist in increasing order - nearestneighbor first; and we change the order of the group labels GrpTrn accordingly. Now we observe the knearest neighbors in the re-arranged group variable - the actual distances are not of real interest anymore.

Algorithm 1 kNN classification. DL=TRN (training samples), DT=TST (testing samples). G vector with grouplabels (length = nTrainingSamples).

Initialization scale dataTraining training samples DL with class (group) labels G.

(In fact, no actual training takes place here)Testing for a testing sample (∈ DT ): compute distances to all training samples→ D,

rank (order) D → Dr

Decision observe the first k (ranked) distances in Dr (the k nearest neighbors):e.g. majority vote of the most frequent class label of the kNN determines category label

Tree-Type Data Structures As mentioned already, for large number of samples the distance measure-ments become time consuming. But one can save some of the measurements by exploiting ’relations’. Forinstance, if we have determined that the distance between some point A and some point B is large, and wehave determined that the distance between A and some point C is small, then we know that the distancebetween B and C is also large. There are structures that build a tree-type storage, with which one candetermine distances much quicker. Those structures are akin to decision trees we will introduce in Section9 - similar to the flow diagram we had in school. Those structures are however only useful if the dimen-sionality is low, i.e. around 9 to 16 dimensions, a number which depends a bit on the expert’s viewpoint.And they only are useful if the data is not sparse. Sparse means that the data contain a lot of zeros, that issparse refers to the few entries that have non-zero value. If those two conditions are fulfilled, then softwarepackages will create automatically a tree-type data structure that makes distance measurements faster. Ifthose two conditions are not fulfilled, then the distance measurements are computed in full and that cantake very long for large datasets. In Matlab this is called ’exhaustive search’.

Variations The classification decision can be also carried out by giving weights to the neighbors: withthat we can give preference to some neighbors over others. This is implemented as options in the re-spective software commands. kNN classification can also be carried out with a fixed neighborhood: onespecifies a distance, which so corresponds to a radius. In Matlab available with rangesearch, in Python asRadiusNeighborsClassifier.

22

3.1 Usage in Matlab

The code segments introduced next, also exist as a copy-paste example in Appendix F.5. Here we highlightcertain lines.The simplest way to apply Matlab’s function fitcknn is to feed it the entire data set DAT with the entire grouplabel vector GrpLb:

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,5);

pc = 1-kfoldLoss(Mdl); % percent correct = 1-error

In this case, the folding is done automatically, namely a 10-fold cross-validation: 9 folds are used for training,1 fold is used for testing; then we rotate 9 times, obtaining so 10 classification estimates in total. With thefunction kfoldLoss the average error for the 10 rotations is calculated automatically. We can specify adifferent number of folds, i.e.

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN, ’kfold’,nFld);

If we wish to control the folding ourselves, then we can do this using the function crossval:

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN);

MdlF = crossval(Mdl, ’kfold’,nFld);

pcf = 1-kfoldLoss(MdlF);

If we desire to fold the data completely ourselves, then here is how we would use the functions for one fold :

Mdl = fitcknn(TRN, Grp.Trn, ’NumNeighbors’,nNN);

LbPred = predict(Mdl, TST);

Bhit = LbPred==Grp.Tst; % binary vector with hits equal 1

The function predict takes the model and estimates labels for the testing data TST. It outputs a one-dimensional variable, that contains the predicted group labels, called LbPred here. Now you only need tocompare them with your true (actual) group labels, LbPred==Grp.Tst and evaluate that binary vector: 1’scorresponds to a hit (correct classification) and 0’s corresponding to incorrect classification.

Own Implementation - knnsearch If you wish to write your own implementation, then the functionknnsearch comes in handy. You pass the training and testing set as variables and the number of k asparameter,

[IXNN Dist] = knnsearch(TRN, TST, ’k’, 5);

and you receive the ntst × k matrix IXNN which contains the indices to the training samples (ntst = number oftesting samples). Those you need to convert to the corresponding grouping variables and then determinethe most frequent group (class) in your given neighborhood k.

Older Matlab versions had a function knnclassify, which one applied as follows:

LbPred = knnclassify(TST, TRN, Grp.Trn, 5);

This is not included anymore in the code example in the Appendix.

Large Data In Matlab, dimensionality up to 9 will be converted into a tree-type structure. One can alsoenforce the application of the use of tree-type structures, but then it is left to the user to interpret the resultsproperly.

23

3.2 Implementation

Coding a kNN classifier is fairly easy. Here are some fragments to understand how little it actually requires(see also ThKo p82). A complete example is given in Appendix F.5.

%% --- Knn classification

nCls = 2; % # of classes

nTrn = size(TRN,1); % # of training samples

nTst = size(TST,1); % # of testing samples

GNN = zeros(nTst,11); % we will store 11 nearest neighbors

for i = 1:nTst

iTst = repmat(TST(i,:), nTrn, 1); % replicate to same size [nTrn nDim]

Diff = TRN-iTst; % difference [nTrn nDim]

Dist = sum(abs(Diff),2); % Manhattan distance [nTrn 1]

[dst ix] = min(Dist); % min distance for 1-NN

[~, O] = sort(Dist,’ascend’); % increasing dist for k-NN

GNN(i,:) = Grp.Trn(O(1:11)); % closest 11 samples

end

%% --- Knn analysis quick (for 5 NN)

HNN = histc(GNN(:,1:5), 1:nCls, 2); % histogram for 5 NN

[Fq LbTst] = max(HNN, [], 2); % LbTst contains class assignment

Hit = LbTst==Grp.Tst;

fprintf(’Perc correct for 5NN %1.2f\n’, nnz(Hit)/nTst*100);

See also the progamming hints in Section C.1 for why we chose a for-loop in this case.

In Appendix F.5.1 we give an example of how to analyze a range of different ks.

3.3 Recapitulation

Recommendation Even though the kNN may not provide a good prediction accuracy (percent correctlyclassified), it can serve as a lower reference when using other classifiers: if we do not obtain a betterprediction performance with more complex classifiers, then we should consider the possibility that we havenot properly applied the complex classifiers.

Advantages- With the kNN classifier we obtain a lower ’reference’ with an easily implementable decision model.- The kNN classifier even works when only few training samples are available, for instance n < 5 per class,

a situation for which other classifiers can be vague in prediction.

Disadvantages The larger the data set, the slower the classification duration. In professional terminology,one says that the kNN classifier has complexity O(dn), with d the dimensionality (number of attributes) andn the number of samples. This is also called the O-notation and we will explain it a bit more later to sparetoo many details (Section 16.2). To alleviate the complexity problem, one can use tree-type data structures,but that only works for limited dimensionality.

Other The kNN classifier does not have an actual learning process, that is, no effort was made in ab-stracting or manipulating the data to derive a simple decision model. In fact, it implements a decision ruleonly and nothing more.

24

4 Linear Classifier (I)

A linear classifier tries to find a line that separates the data points of different classes. That line is also calleda decision boundary. In case of a two-dimensional example, see Figure 6, one can think of the boundaryas a straight line. If we want to estimate the class label of a new (testing) sample - the green circle -, thenone would determine on which side of the boundary it lies.

Figure 6: What would be the most suitable straight linethat separates the two point clouds? A linear classifier isan algorithm that tries to find the ideal line parameters.For a new sample point, one would determine on whichside of the line it lies.

In two dimensions, the line equation is described by a slope m and an offset c:

y = mx+ c. (1)

The learning procedure of a linear classifier attempts to find a suitable slope m and bias c, such that thex/y-values for one class are well separated from the x/y-values of another class. Finding the optimalparameters is similar to the regression problem, which attempts to fit a straight line equation to a set ofpoints. In order to classify a (testing) sample point, we enter its coordinate values to the line equation anddetermine whether it is larger or smaller than c.

For three dimensions, there would be two values of m; in that case we attempt to find a plane; for four ormore dimensions the number of m grows correspondingly and we talk of hyperplanes. There remains onlyone bias parameter c. In the terminology of pattern recognition terminology, we speak of weights instead ofslope and offset, and both are lumped into a weight vector w. To explain now the classification procedurein more detail, we continue with the example of a binary classification task - a classification of two classesonly -, and then elaborate on the multi-class procedure - a classification with three or more classes:

Binary classification (2 classes): Assume the input vector x(i) has d dimensions, i = 1, .., d. Then therewill be a corresponding weight vector w(i) of same dimensionality - the same number of components -which can be regarded as the m-values of the linear equation. Now we add one component to each vector,which corresponds to the bias c of the linear equation: to the input vector we add the component x0 = 1,namely a constant value equal one; to the weight vector we add another variable w0. Those two vectors arethen element-wise multiplied and the products are then summed:

g(x) =

i=d∑i=0

x(i)w(i) (2)

g is also called the discrimination function and it produces a scalar - a single value. Figure 7 illustrates thisprocedure: here one talks of ’units’ which correspond to the term components (or features).

The above sum of products is in mathematics also called the dot product or scalar product or innerproduct. To clarify, each component x(i) - sometimes also denoted as xi - is multiplied by the corresponding

25

weight value w(i) (or wi) and the products are summed. This product can be expressed also by two othernotations, which are more compact: ∑

i

x(i)w(i) ≡ x ·w ≡ xtw. (3)

If this notation is completely unfamiliar to you, then study Appendix D.

Figure 7: A simple linear, binary classifier having d input units, each corresponding to the values of the componentsof an input vector. Each input feature value xi is multiplied by its corresponding weight wi; the effective input at theoutput unit is the sum all these products,

∑wixi. We show in each unit its effective input-output function. Thus

each of the d input units is linear, emitting exactly the value of its corresponding feature value. The single bias unitunit always emits the constant value 1.0. The single output unit emits a + 1 if wtx + w0 > 0 or a − 1 otherwise.[Source: Duda,Hart,Storck 2001, Fig 5.1]

The dot product represents already the classification principal of the model. To decide the class label of atesting sample, we observe the sign of the scalar value g(x): a positive value means the sample belongs toone class, a negative value means it belongs to the other class.

Multiple-Class Classification: For a classification task with multiple classes, there is a weight vector wk

(of length d) for each individual class k. Those k weight vectors are concatenated and expressed as ak × d weight matrix W, whose size explicitly expressed is [number of classes x number of dimensions].The classification procedure then consists of two steps: the first step is the computation of the discriminantvalue for each class; the second step is the selection of the maximum discriminant value to select the most’fitting’ class label.

The computation of k discriminant values is expressed with a single line, whereby here we add thevariable k as subscript to g to express that we obtain a one-dimensional array of values:

gk(x) = xtW. (4)

The notation appears not to have changed much in comparison to the dot product (eq. 3), but the use ofa matrix W makes it now a matrix product, see again Appendix D for details. As explained already, theresult here are k values. In the context of pattern recognition, those values are also posterior values; theyexpress the confidence for a class.

To select now the class label with the highest confidence, we simply apply the maximum operation tothe array gk:

argmaxkgk. (5)

26

’arg’ stands for argument, meaning the index of where in the array gk the maximum occurs. In Matlab thisis included in the command max by specifying a second output argument:

[vl ix] = max(Post);

That is, variable ix holds the selected class index; variable vl holds the maximum posterior value.

Variants: The above equations represent only a principle and there exist many variants but all linearclassifier models contain at their heart the matrix product between a testing sample and a weight matrix ofcorresponding dimensionality. Most modern linear classifier models also analyze the dependence betweenthe feature variables using either the covariance matrix directly, or a similar analysis. The covariance matrixwill be introduced in the following Section 4.1.

Learning: The challenge is of course to find the appropriate weight values W, which would best separatethe classes. Mathematically speaking - and formulated for a two-class (binary) problem -, we deal witha linear programming problem because trying to find the discrimination functions gi(x) is dealing with aset of linear inequalities: wtxi > 0. There exists a large number of methods to solve such inequalities ofwhich many belong to two important categories: gradient descent procedures and matrix decompositionsmethods. We do not elaborate on these methods, but merely explain how one type of matrix decompositionis implemented (Section 4.3). In Section 15, we explain how w is estimated in a crude but straightforwardmanner.

Building and applying a linear classifier can be summarized as follows:

Algorithm 2 Linear classifier principle. k = 1, .., nclasses. Wk×d = {wk}, G vector with class labels.Training find optimal weight matrix W for gk(x) = xtW exploiting G (x ∈ DL)

using matrix decomposition for instance.Testing for a testing sample x determine gk(x) = xtW (x ∈ DT )Decision chose maximum of gk: argmaxk gk

4.1 Covariance Matrix Σ wiki Covariance, Covariance matrix

The covariance matrix expresses to what degree the individual variables (or features or dimensions) dependon each other. It is used in many machine learning algorithms.

We recall that for a single variable, the variance measures how much its values are spread aroundtheir mean. Analogously, when we have two variables A and B, then we observe how the correspondingelements co-vary with respect to the two corresponding means, A and B respectively:

qA,B =1

N − 1

N∑i=1

(Ai −A)(Bi −B), (6)

where N is the number of observations. If the individual differences co-vary, then the covariance value qA,Bis positive, otherwise it is negative. The divisor N − 1 is typical for an estimate of the covariance.

For reasons of practicality, one generates a full matrix Σ, which for two variables would look as follows

Σ =

[qA,A qA,BqB,A qB,B

], (7)

where the entries along the first diagonal (upper left to lower right) correspond to the ’self-variance’ of thevariables, namely qA,A and qB,B ; and the other two entries have the same value, that is qB,A = qA,B . Forthree or more dimensions, one would create a d × d matrix (d = number of dimensions), with values alongthe diagonal again being the self-variance; and corresponding values above and below the diagonal are

27

also the same. The covariance matrix is thus a square, symmetric matrix; its calculation can be expressedalso in matrix notation - the exercise will reveal how.

Observe that Σ is denoted as boldface as opposed to the summation sign Σ. Σ is rather the symbolfor the unbiased (theoretic) covariance matrix, which we typically do not know. That is why the estimatedcovariance matrix is denoted with a ’hat’, namely as Σ̂ in books, spoken sigma hat. In Matlab, the matrixΣ̂ can be generated with the command cov, or if the data contain not-a-number entries then you can usenancov.

Small Sample Size Problem In those classifiers that make use of the covariance matrix, the matrix isoften computed for each class separately, which will be made explicit in Section 15. Calculating a covariancematrix is easy, yet for many linear classifiers this matrix needs also to show certain properties, e.g. it needsto be positive semi-finite, meaning that after some transformation the covariance matrix has only positivevalues in a limited range. We omit the details of those properties. Our concern is that this constraint issometimes not fulfilled, in particular if the dataset has only few samples. And if the covariance matrixis generated for each individual class, then the problem of obtaining an ’adequate’ covariance matrix isaggravated. This is known as the small sample size problem. If the appropriate covariance matrix can notbe generated, then Matlab may complain with the following error:

The pooled covariance matrix of TRAINING must be positive definite.

To work around this barrier, it is easiest to apply a dimensionality reduction using the PCA and then retrywith lower dimensionality, which will be the topic of the upcoming Section 5.

4.2 Usage in Matlab

In Matlab we can evaluate our data with a linear classifier using the command fitcdiscr:

MdCv = fitcdiscr(DAT, GrpLb, ’kfold’,nFld);

pc = 1-kfoldLoss(MdCv); % percent correct = 1-error

a formulation that is analogous to the use of the kNN classifier, compare with Section 3.1; see also theoverview in Appendix F.1 again. Folding can be explicitly instructed as

Mdl = fitcdiscr(DAT, GrpLb);

MdCv = crossval(Mdl, ’kfold’,nFld);

pc = 1-kfoldLoss(MdCv);

which again is analogous to the kNN example. Appendix F.7 ensures that the reader has a working example.If one intends to fold the data by oneself, then we apply the for-loop as exemplified in the kNN example;

the generic folding loop is also illustrated in Section 6.1.

Classification errors/difficulties Should the data set be difficult to handle for a linear classifier, then wecan try adding the option pseudolinear as follows:

Mdl = fitcdiscr(DAT, GrpLb, ’discrimType’, ’pseudolinear’);

This may return sub-optimal results, but at least the classification task can be solved.

4.3 Implementation (Matrix Decomposition)

In the following we point out how a linear classifier can be implemented, whereby the code here is takenfrom the old Matlab function classify. The code was slightly modified to make it compatible with ourvariable names (TREN, TEST,...). There are two essential steps for training and one for testing (applying),enumerated as steps 1 to 3 now:

28

1. Learning: for each class the mean of its training samples is calculated and assigned to the matrixGmeans:Gmeans = NaN(nGrp, nDim);

for k = 1:nGrp

Gmeans(k,:) = mean(TREN(GrpTrn==k,:),1);

end

2. Learning: Now we estimate the covariance matrix using the orthogonal-triangular decomposition,which is carried out with the command qr: as argument we enter the training data subtract by thegroup means. The second line performs a scaling division:[~,R] = qr(TREN - Gmeans(GrpTrn,:), 0);

R = R / sqrt(nObs - nGrp); % SigmaHat = R’*R

s = svd(R);

if any(s <= max(nObs,nDim) * eps(max(s)))

error(message(’stats:classify:BadLinearVar’));

end

logDetSigma = 2*sum(log(s)); % avoid over/underflow

Then, another matrix decomposition follows, namely the singular value decomposition, see commandsvd in third line. After that it is verified that the covariance matrix is adequate. We do not furtherexplain these decompositions for reason of brevity. The result is a matrix R and a scalar logDetSigmawhich are the equivalent of the weight vector w.

3. Applying: When we classify testing samples, the function generates a ’confidence’ value for eachclass, which is also called the posterior : it is placed into the matrix D here (number of testing samples× number of dimensions). The operation is quasi the dot product (eq. 2):for k = 1:nGrp

A = bsxfun(@minus,TEST, Gmeans(k,:)) / R;

D(:,k) = log(prior(k)) - .5*(sum(A .* A, 2) + logDetSigma);

end

Some of this will be clarified, when we look at the Naive Bayes classifier in Section 15, but for themoment we simply apply this procedure without understanding every detail.

With the posteriors in D we determine the class label by the argmax operation (eq. 5), see the example inAppendix F.7.

4.4 Recapitulation

A linear classifier is simple to apply, returns reasonable results and is fast in testing. The computation ofthe weights is often done with a covariance matrix (or some variant), whose space complexity is thereforeO(d2); for classification one needs to perform a matrix product only, whose space and time complexity isonly O(d), which makes it therefore a very popular classifier.

Advantages There is no need to set any parameters. The efficiency of a linear classifier is unparalleled.To obtain a better performance with a different classifier, the learning duration will increase substantiallyand it will require the adjustment of some parameters.

Disadvantages- For some datasets it can be difficult to obtain a ’proper’ covariance matrix. In that case simply apply the

Principal Component Analysis first - coming up in the next Section - and then apply the linear classifieragain. This little hurdle does not harm the advantages of the linear classifier - it is in fact a perfectsymbiosis.

29

- For excessively high dimensionality, i.e. thousands of variables, the generation of the covariance ma-trix can become unpractical due to its square complexity O(d2): a regular computer may not havesufficient RAM to calculate the matrix decompositions.

- In comparison the kNN classifier, it is difficult to obtain reliable results for a small training set, for instancenSamples < 5 for a class.

30

5 Dimensionality Reduction

Sometimes it is useful - if not even necessary - to reduce the number of dimensions (features, variables)of the data, because the data often contain variables that are irrelevant for classification and that are bettereliminated. In the previous section for instance, we mentioned that the inverse of the covariance matrix cannot be computed sometimes - which is necessary statistics for the generation of an efficient linear classifier:it is easier to compute that inverse, after we have eliminated seemingly unnecessary features. There existmore powerful classifiers than the linear classifier that are able to ignore such irrelevant variables, but evenfor those classifiers it is almost always of advantage to eliminate the most irrelevant variables.

Dimensionality reduction can occur in two principally different ways, whereby here the term featurestands for dimension (and not for a feature vector representing a sample):

Feature Transformation/Generation is the transformation or combination of the original set of featuresto create a new (reduced) set of features. This means, we compute intensively with the dataset. Themost famous transformation is the principal component analysis (PCA) and is treated next (Section5.1).

Feature Selection (Reduction) is the selection of the best subset of the (original) input feature set (Sec-tion 5.2). This is often done by observing the variables individually - as opposed to the transformationmethods - and by applying a simple statistical or information-theoretic measure that measures thepotential relevance of a variable.

5.1 Feature Transformation - PCA DHS p115, 568

Alp p113

ThKo p326The most popular method for feature transformation is the principal component analysis (PCA), also calledthe Karhunen-Loeve transform. Here, the term component stands for feature (variable). The PCA works byrealigning the coordinate system to the distribution of the data. wiki Principal component analysisExample: Assume we have a 2D dataset, whose overall distribution appears like the shape of an ellipse:the ellipse’s larger diameter is rotated by 45 degrees, see Figure 8 left side (point cloud is outlined alreadyby the ellipse shape). It is clear that this elliptical point cloud has two major ’directions’, which are denotedas z1 and z2. The first one is the dominant one, the second one is aligned orthogonally to the first one. ThePCA detects those principal axes and then places the axes of the original coordinate system - x1 and x2 -onto the new directions z1 and z2, illustrated on the right side in Figure 8.

In other words, the PCA determines the ’directions’ of greatest variance in the data, and then rotates thecoordinate axes to those directions and it moves the origin of the coordinate axes onto the data’s center.There exist different procedures to perform the PCA. Here we sketch the one using the covariance matrix.It consists of five basic steps:

1. The mean and covariance for the data are determined. The mean results in a single vector µ (dimen-sionality d); the d× d covariance matrix Σ was introduced before (Section 4.1).

2. The eigenvectors ei and eigenvalues λi are computed from the covariance matrix. For each dimensioni there exists a eigenvector e and its corresponding eigenvalue λ. They represent the directions in thepoint distribution. The eigenvalues represent the ’significance’ of the direction. We omit the details ofhow they are generated.

3. We now chose the k largest eigenvalues and their corresponding eigenvalues. There are differentways to choose k.

4. We build a d× k matrix A consisting of the k eigenvectors.

5. The original data x are multiplied with matrix A in order to arrive at the reduced data xr, which thenis of dimensionality [number of samples ×k]. This multiplication occurs as explained in Appendix D.

Algorithm 3 summarizes the individual steps. To make accurate estimates with our classifiers later, theoptimal k is determined only for the training set and we thus apply the last step twice: once to the trainingset and once to the testing set.

31

x1

x2

z1

z2z1

z2

Figure 8: Principal components analysis. Left: a set of points whose distributions happens to be elliptical and that canbe represented by the two axes z1 and z2. The PCA procedure centers the samples and then rotates the coordinateaxes to line up with the directions of highest variance. Right: z1 and z2 as the new axes. If the variance on z2 is toosmall, it can be ignored and we have dimensionality reduction from two to one.

Algorithm 3 The steps of the PCA (for the method using the covariance matrix): performed on DL.Parameters k: number of principal componentsInitialization none particularInput xj(i): list of observations (DL), j = 1, .., nObservations, i = 1, .., d (nDimensions)1) Compute: µ: d-dim mean vector

Σ: d× d covariance matrix2) Compute eigenvectors ei and eigenvalues λi (i is dimension index)3) Selection of k largest eigenvalues and corresponding eigenvectors4) Build d× k matrix A with columns consisting of the k eigenvectors5) Projection of data xj onto k-dim subspace x′: x′ = F1(x) = At(x− µ)Output x′j : list of transformed observations

5.1.1 Usage in Matlab

Matlab provides the command pca (older Matlab versions use princomp), which carries out steps 1 and 2of Algorithm 3. Here is a complete example, the explanations are added below.

[coeff,~,lat] = pca(DAT); % the princ-comp analysis in Matlab

nPco = round(min(size(DAT))*0.7); % reduced dimensionality

PCO = coeff(:,1:nPco); % select the 1st nPco eigenvectors

% --- Reduce Data:

DATRed = zeros(nObs,nPco); % init reduced data matrix

for i = 1:nObs,

DATRed(i,:) = DAT(i,:) * PCO; % transform each sample

end

DAT = DATRed; % replace ’old’ dataset with new, reduced dataset

clear DATRed; % clear to save on memory

- The variable DAT is the original data of size n × d [nObs=n, number of samples/observations]. Thecommand pca returns a d × d matrix called coeff as well as a vector of latencies lat; we ignorethe second output argument for the moment. The matrix coeff corresponds to the eigenvectors, thevariable lat corresponds to the eigenvalues - if the eigendecomposition was used (see step 2 in

32

Algorithm 3). The default for the command pca is however a singular value decomposition, anothermatrix decomposition method, that we encountered already for the linear classifier.

- We then choose the number k (=nPco in the code), which can be done based on the values in lat.For simplicity, we choose here k based on dimensionality, where the proportion value of 0.7 is onlya suggestion, but one that should return reasonable results. Should there be fewer samples thandimensions - hopefully not -, then k needs to be smaller then the number of observations minus one -that is why we use the minimum function on the size output. More on the choice of k will follow later.

- Then we create the submatrix PCO of dimensionality d× k, corresponding to matrix A in Algorithm 3.

- Finally, we multiply each sample (x = DAT(i,:)) by this submatrix and obtain the data DATRed with lowerdimensionality (size n× k). That is the data you would then use for classification, i.e. with fitcdiscr

5.1.2 Choice of number of principal components (k)

There does not exist a single recipe for choosing an optimal k. Here are some suggestions:- One dataset: if one classifies only a single dataset, one could manually observe the ’variances’ in vari-

able lat (latencies). Some prefer the elbow method, meaning one takes the value where its spatialdistance to the straight line connecting the curve endpoints in the plot is largest. More explicitly, firstcalculate the line equation connecting the largest and lowest latency value; then, measure the dis-tance between each latency and the line equation; choose the maximal distance as being the optimalk.

- Several datasets: observe the lat values for some sets and try to derive a reasonable rule, e.g. the first kcomponents until 95 or 99 percent of the total variance is used up. For that purpose simply normalizethe values, LatNrm = lat/sum(lat);, the sum should be equal one.

- Try a range of values and observe where the prediction accuracy is maximal. If the task is classification,then we choose k in dependence of the accuracy, which however should be the same for all folds inorder to properly predict. If the task is clustering, then we choose k in dependence of the criterion forcluster validity. This will be further explained in the upcoming sections.

- Restriction: a meaningful number of principal components has to be less than the number of samplesminus one, hence the operation min(size(DAT)) in the above code example. This is also mentionedin the Matlab documentation as the ’degrees of freedom’.

5.1.3 Recapitulation

Advantages The PCA works quasi without parameters. We need to merely choose the desired numberk of components.

Disadvantages It is possible that the PCA eliminates some dimensions that could have been useful fordiscrimination of some classes. But that loss in discrimination is typically small and thus negligible incomparison to the ease of use of the procedure. In the end, the PCA often allows us to obtain a predictionaccuracy at all with the linear classifier.

5.2 Feature Selection Alp p110

ThKo p261, ch5, pdf 274

Should the combination of PCA and linear classifier have failed, then the next step would be to evaluate thevariables (dimensions) individually using the class labels for help (called Grp previously). For each variablewe determine with some statistical or other measure, weather the values for one class do actually differfrom the values of another class. The easiest way would be to compare the mean values for each class, butthat is somewhat too simplistic, because the values for one class can substantially overlap with the valuesand therefore have very similar mean values but be different in shape nevertheless. One therefore resortsto statistical tests, such as the t-Test, or to information theoretic measures, such as the Kullback-Lieblerdistance. Using such measure one ranks the variables and selects the most significant ones.

33

Many of these measures compare only two distributions, meaning two classes. Thus if we have three ormore classes then we need to write a loop that tests each class versus all others. Furthermore, some of themeasures assume that the variables show a normal (Gaussian) distribution, which is rarely the case. It istherefore beneficial to observe several measures and choose the one that yields the highest performance.


The bioinformatics toolbox provides the function rankfeatures which ranks the variables according to someselected criterion. The function performs only for binary classifiers. Thus, for more than two classes weneed to write a loop and take the average for instance.

Two Classes If Bg1 is a binary vector that identifies one of the two classes, then one would call thefunction as follows:

[O V] = rankfeatures(DAT’, Bg1, ’criterion’, ’ttest’);

Note that the data matrix DAT is passed in flipped (rotated) format - unlike so many other patter recognitionfunctions requiring the data matrix. O is the order of indices in decreasing order; V holds the values in ordercorresponding to DAT. Appendix F.9 contains an example applying to function on artificial data.

Multiple Classes In this case, we test one class against all others - to create a binary evaluation; andwe loop through all c classes (groups). We write a loop in which the binary vector (Bg1 above) identifiesone class by 1s for instance, and the remaining classes are all set to 0s. Then we apply again the functionrankfeatures as above. We obtain the values V c times; we then average those values and then rank thefeatures again.

If the command rankfeatures is not present, then we could take the ROC value for instance as shown inAppendix F.11.1: it provides a function script that takes the so-called ROC value of two distributions.

34

6 Evaluating and Improving Classifiers

We now elaborate on how to characterize and optimize the performance of a classifier. One important issuewas already introduced, namely the proper estimation of the prediction accuracy of our classifier modelusing the process of cross-validation; we elaborate on that issue in the following Section 6.1. Then weintroduce performance measures that were developed in particular for binary classifiers (Section 6.2): wewill learn that there is often a trade-off and sometimes we wish to bias the decision in favor of one side.Then we observe some more tricks that can be applied to analyze multi-class tasks (Section 6.3). And weclose with more tricks on analyzing and improving classifier performance (Section 6.4).

6.1 Types of Error Estimation DHS p465

When we estimate the classification error - or the percentage correct classification - there are two types ofmeasures to characterize that estimate: bias and variance. The two measures are analogous to the terms’accuracy’ and ’precision’. More specifically, they are defined as (see also Figure 9):

Bias measures the accuracy or quality of the match, that is the difference between the estimated and theactual accuracy - the latter we typically do not know. A high bias implies a poor match.

Variance measures the precision or specificity of the match. A high variance implies a weak match.

estimate

reference

variance

bias

Figure 9: Individual accuracy estimates are shown as black dots on an axis. The actual accuracy value is indicatedas ’reference’ (typically we do not know that value); our overall estimate is indicated as a vertical bar labeled ’estimate’.Bias is the difference between the reference and the estimated value. Variance expresses how scattered the individualestimates are around the overall estimate. We would like both - variance and bias - to be small.

Bias and variance are affected by the type of resampling, see Table 1 for a summary of methods. So far wehad used the the method of cross-validation, in which the total dataset is partitioned (folded) into severalequally-sized sets, i.e. five folds: four of those folds serve as training, whereas the remaining fold is usedfor testing. The folds are then rotated (shifted) to obtain five different prediction estimates, which are thenaveraged to obtain the overall estimate.

35

Table 1: Error estimation methods (from Jain et al. 2000). n: sample size, d: dimensionality. wiki Resampling (statistics)Method Property CommentsResubstitution All the available data is used for training as well as testing;

training set = test set.Optimistically biased estimate, especially whenn/d is small.

Holdout Half the data is used for training and the remaining datais used for testing; training and test sets are independent.

Pessimistically biased estimate; different parti-tionings will give different estimates.

Leave-one-out,Jackknife

A classifier is designed using (n-1) samples and evaluat-ed on the one remaining sample; this is repeated n timeswith different trainings sets of size (n-1).

Estimate is unbiased but it has a large variance;large computational requirement because n dif-ferent classifiers have to be designed.

Rotation, n-foldcross validation

A compromise between holdout and leave-one-out me-thods; divide the available samples into P disjoint sub-sets, 1≤P≤ n. Use (P -1) subsets for training and theremaining subset for test.

Estimate has lower bias than the holdout methodand is cheaper to implement than the leave-one-out method.Ix = crossvalind(’kfold’,100,5);

Bootstrap Generate many bootstrap sample sets of size n by sam-pling with replacement; several estimators of the error ratecan be defined.

Bootstrap estimates can have lower variancethan the leave-one-out method; computationallymore demanding; useful for small n.

Usage in Matlab With the command crossvalind (bioinfo toolbox) we can generate these folds. WithGrp the grouping variable we could classify the data as follows:

nFld = 5;

Fld = crossvalind(’kfold’, Grp, nFld);

for i = 1:nFld

Btst = Fld==i; % logic vector indicating testing samples

Btrn = ~Btst; % logic vector indicating training sample

GrpTren = Grp(Btrn); % grouping variable for training

GrpTest = Grp(Btst); % grouping variable for testing

LbPrd = classify(DAT(Btst,:), DAT(Btrn,:), GrpTren); % now classify

Bhit = LbPrd==GrpTest; % logic vector with hits

...further analysis...

end

We have given an example how to use the explicit folding in Appendix F.5 (the kNN example). Should thecommand crossvalind be missing, we can write the function also ourselves, see for instance AppendixF.10.

6.1.1 Validation Set

There are situations, when also a validation set is required: the validation set is a testing set inside thetraining set. For instance, for feature selection as introduced in Section 5.2 we require a validation set inorder to determine when to stop the process of selecting features, otherwise the classification performancewill rather deteriorate. Similarly, when training a neural network, it is recommended to employ a validationset. The validation set is typically taken to be one of the training folds: thus in five-fold cross validation,three folds would serve for training, one fold for validating and the remaining fold for testing.

36

6.2 Binary Classifiers

If a classifier distinguishes between two classes only - a binary classifier -, then there is a specific setof measures to characterize its performance (Alp p489). Those measures stem from the domains of signaldetection theory and information theory. To understand the logic of those measures, we use the example ofa doctor inspecting an X-ray image and making a decision whether the patient requires treatment or not. Tomake that decision the doctor needs to detect some specific pattern (signal) - for instance, the bones showan unusual texture. But because analyzing X-ray images is difficult and because there exists no absolutecertainty, a decision can therefore be of one of four possible response outcomes. How those four responseoutcomes occur may be best understood by looking at the graph in Figure 10.

TP

TN

FNFP

signal

noise

NegativesPositives

θP

d

Figure 10: Discrimination of two ’overlapping’ classes: probability/frequency versus variable (or feature, dimension).The left distribution represents the signal; the right distribution represents the noise. A decision threshold θ is set and weobtain four response types: true positives (TP; hit), false positives (FP), false negatives (FN; miss), and true negatives(TN).

The graph depicts two overlapping density distributions: the one on the left represents the signal - thepattern that the doctor needs to detect; the one on the right represents the noise - or background or dis-tracter, from which the doctor needs to discriminate the pattern. To make a decision, a threshold θ is appliedand a side is chosen: if the left side of θ is chosen, one has considered the stimulus (or input) as a signal;if the right side is chosen, one has considered the stimulus as distracter. For either choice, our predictionmay be right or wrong. If the stimulus is considered the signal and it was truly the signal, then we talk ofa ’hit’ or ’true positive’; if it was not the signal, then we call it a ’false positive’ or ’false alarm’. Analogous,if the stimulus was truly the distracter, then we talk of ’true negative’ or ’correct rejection’; if it was not thedistracter, then we label it ’false negative’ or ’miss’. Those four response types are summarized below:

Label 1 Label 2 Part of distribution ExampleTP true positive hit left of θ, under signal Sick people correctly diagnosed as sick

TN true negative correct rejection right of θ, under noise Healthy people correctly identified as healthy

FP false positive false alarm left of θ, under noise Healthy people incorrectly identified as sick

FN false negative miss right of θ, under signal Sick people incorrectly identified as healthy

Clearly, wherever the decision threshold θ is set, it results in a trade-off. If the doctor wants to avoidunnecessary treatment - of persons incorrectly identified as sick -, then he choses (implicitly) a thresholdmore to the left; thereby the doctor would also miss some actual false negatives - sick people incorrectlyclassified as healthy. And vice versa, if the doctor attempts to ensure that all sick people are treated, thenhe choses a threshold more to the right, but thereby treating also some healthy persons.

To quantify this trade-off there exist different measures. In a first step, the responses are arranged in aso-called confusion matrix (Section 6.2.1). From that matrix we use certain quantities with which we cancalculate different trade-off measures, depending on your preferred viewpoint (Section 6.2.2). If we have the

37

possibility to influence the system performance by changing some parameters, then we can even measurean ROC curve (Section 6.2.3).

6.2.1 Confusion Matrix wiki Confusion matrix

In the confusion matrix we organize the four response outcomes as ’predicted’ versus ’actual’ and arrive soat a 2× 2 array. This response table is also known as contingency table or cross tabulation.

ActualMatches Non-matches

Predicted Matches TP FP P’Non-matches FN TN N’

P N

The columns sum up to the actual number of positives (P) and negatives (N), while the rows sum up to thepredicted number of positives (P’) and negatives (N’):

P = # positive actual instances = TP + FNN = # negative actual instances = FP + TNP’ = # positive classified instances = TP + FPN’ = # negative classified instances = FN + TN

Example: Assume you have made 100 observation decisions in your task of which 22 times you predictedthe presence of the signal and 78 times you predicted its absence, P’ and N’ respectively. Later you areinformed that 18 of your ’presence’ predictions were correct (= true positives), and 76 of your absence pre-dictions were also correct (= true negatives). Then, you can calculate the frequency of the two remainingresponse types, namely FP and FN:

ActualMatches Non-matches

Predicted Matches TP = 18 FP = 4 P’ = 22Non-matches FN = 2 TN = 76 N’ = 78

P = 20 N = 80 100 total

6.2.2 Measures and Response Manipulation wiki Precision and recall

Given the quantities as calculated above, we can then determine measures that quantify the classificationperformance more precisely - than if we used only the hit rate and the above confusion matrix. Thosequantities often come in pairs and are typical for certain domains, see Table 2. For three of those four,one quantity is the same, namely the hit-rate TP/P: it is called true-positive rate, recall and sensitivity in therespective domains.

For the above example, the true-positive rate (TP-rate or TPR) is 0.90; the false positive rate (FP-rate orFPR) is 0.05; the accuracy is 0.94.

Response Manipulation There are cases where one may not be satisfied with the response rates ofthe system, because there may exist certain ’costs’ with response types. For instance, in the case of thedoctor’s decision, one may need to take into consideration the price of a treatment or the potential sideeffects of a treatment. Thus, one would like to bias the system in favor of a specific response outcome.Example: Google Map uses an algorithm to blur faces and car license plates in their street view, in orderto avoid lawsuits by private persons who were accidentally photographed during the street view recording.

38

Table 2: Performance measures for a binary task. Some measures have multiple names; each domain (see right mostcolumn) prefers different terminology and definitions. In Matlab: classperf.

Name Formula Other Names Preferred UseError (FP + FN)/(P + N)Accuracy (TP + TN)/(P + N) = 1 - ErrorTrue Pos Rate TP / P Hit-Rate, Recall, Sensitivity ROC curveFalse Pos Rate FP / N Fall-Out RatePrecision TP / P’ InformationRecall TP / P True Pos Rate, Sensitivity RetrievalSensitivity TP / P True Pos Rate, Recall MedicineSpecificity TN / N = 1 - FP-rateF1 score 2 · (Precision·Recall)

(Precision+Recall)

How would you adjust the algorithm? Would you permit unblurred faces? Are false alarms costly?Answer: You probably want to detect all faces to ensure that no law suit is filed, that is you ideally wanta perfect recall. Due to the trade-off this means that false alarms are going to be higher. False alarmsare objects, that appear similar to faces and car license plates. If such objects get blurred occasionallythen this does not really impair the overall benefit of the street, view, hence false alarms do have a lowcost here.

This search for an optimal decision is easier if one knows the relationship for two pair of measures. Thisrelationship is typically plotted as the so-called ROC curve, coming up next.

6.2.3 ROC Analysis wiki Receiver operating characteristic

The ROC analysis is a way of visualizing the response trade-off for a range of decision thresholds. In thatanalysis, the true-positive rate is plotted against the false-positive rate. Those two rates are calculated fordifferent decision thresholds and if one connects the points one obtains the so-called receiver-operatingcharacteristic (ROC) curve, see now Figure 11. It starts in the lower left corner and then increases, whichwould correspond to moving the decision threshold from left to right in Figure 10.

Figure 11: The ROC curve is generated by sys-tematically manipulating the decision threshold andplotting the true positive rate against the false pos-itive rate for each performance measurement. Thecurve lies typically above the diagonal. The diago-nal represents random chance (gray dashed line).Ideally the curve would rise steep - the stippledROC is better than the solid ROC. Sometimes thearea under the curve (AUC) is used as a quantity -the higher the value, the better the performance.

1

chan

ce level

False

Positive Rate

True

Positive Rate

10

Assume that the black (solid) ROC curve reflects the performance of a loosely ’tuned’ model. If onehopes to find a better model, then the curve should bend more toward the upper left, that is it should risefaster toward one, such as in the stippled ROC curve. If the curve becomes flatter and closer to the stippleddiagonal, then the model is worse. If the curve runs along the diagonal, then the decision making is prettymuch random. If the curve runs below the diagonal, something is completely wrong - we may have swapped

39

the classes by mistake.Using the ROC curve, the classification performance is sometimes specified as the area underneath it

(AUC), thus we report a scalar value between 0.5 (chance) and 1.0 (perfect).

6.3 Three or More Classes

For classifiers discriminating three or more classes, the above ’binary’ characterization - the response tableand its measures - is not directly applicable. Often, one reports only the percentage of correct classificationor the error, that is the percentage of misclassification as we did so far in our examples. Nevertheless,we can gain more insight about the classifier model by observing the confusion matrix from which we canderive c binary evaluations.

Confusion Matrix for 2 or more Classes: The confusion matrix is of size c× c, where c is the number ofclasses. In that table, the axes for the actual and the predicted classes are often swapped - as opposed tothe response table introduced in Section 6.2.1: the given (actual) classes are listed row-wise, the predictedresponses are given column-wise. A classifier with good performance would return a confusion matrixwhere mostly the diagonal entries would show high values, namely where actual and predicted class agree.

Example: Assume you have trained a classifier to distinguish between cats, dogs and rabbits. You test theclassifier with 27 samples, 8 cats, 6 dogs, and 13 rabbits. You observe that your model makes the followingconfusions:

PredictedCat Dog Rabbit

ActualCat 5 3 0Dog 2 3 1Rabbit 0 2 11

In Matlab we can use the command confusionmat (stats toolbox) for which the first variable must be theactual group and the second variable must be the predicted group (grouphat):

CM = confusionmat(Grp.Tst, LbTst);

Or we can generate the confusion matrix as follows:

CM = accumarray([Grp.Tst LbTst],1,[nCat nCat]);

c Binary Analyses (One-Versus-All) Using the confusion matrix, we can now generate the binary mea-sures - as introduced before - for each class c individually: for a selected class c, a discrimination betweenthe selected class versus all other categories is carried out. For a category under investigation, with indexi, we can obtain three values:

hit count TP the ith diagonal entryfalse positives count FP(false alarms)

the sum of values along the ith column minus the hit count TP.

false negatives count FN(misses)

the sum of values along the ith row minus the hit count TP.

Note that false alarms and misses appear swapped in comparison to the 2× 2 confusion matrix of Section6.2.1.

In the example above, the hit count for the dog class is three; its false positive count is five (3+2); itsfalse negative count is three (2+1).

In Matlab we can obtain false alarms and misses from the confusion matrix with the following lines (nCls =no. of classes):

40

CMpur = CM; % create a copy, which will be our ’pure’ confusions

CMpur(diag(true(nCls,1))) = 0; % set all diagonal entries to 0 (hits knocked out)

Cfsa = sum(CMpur,1); % false alarms per category

Cmss = sum(CMpur,2); % misses per category

6.4 More Tricks & Hints

We now list optimizations that one can perform in order to seek improvement of a classifier. The first oneis a potential pitfall when we use classes with few instances (Section 6.4.1). The second one observesthe classifier accuracy for different amounts of training (Section 6.4.2). The third optimization concerns theenlarging of the training set by manipulating or sub-selecting the training set (Section 6.4.3).

6.4.1 Class Imbalance Problem ThKo p237

In practice there may exist datasets in which one class has many more samples than another class; orsome classes may have only few samples due to their rarity for instance. This is usually referred to asthe class imbalance problem. Such situations occur in a number of applications such as text classification,diagnosis of rare medical conditions, and detection of oil spills in satellite imaging. Class imbalance maynot be a problem if the task is easy to learn, that is if classes are well separable; or if a large training datasetis available. If not, one may consider to try to avoid possible harmful effects by ’rebalancing’ the classes byeither over-sampling the small classes and/or by under-sampling the large classes.

6.4.2 Learning Curve

It is common to test the classifier for different amounts of learning samples (e.g. 5, 10, 15, 20 trainingsamples), and to plot classification accuracy (and/or error) as a function of the number of training samples, agraph called learning curve. An increase in sample size should typically lead to an increase in performance- at least initially; if performance only decreases then something is wrong. For some classifiers - neuralnetworks typically -, the classification accuracy may start to decrease for very large amounts of training dueto a phenomenon called overtraining (overfitting). In that case one would would like to stop learning whenthe accuracy starts to decrease. To achieve that, one employs a validation set (see above Section 6.1.1):the performance on the validation set would increase initially and then saturate with learning duration; itwould start to decrease when overtraining starts to occur, and that is the point when training should bestopped.

6.4.3 Improvement with Hard Negative Mining and Artificial Samples

Now we mention two tricks that may help to improve classifier performance. These are tricks that aretypically used with other classifiers, but can very well be tested with a linear classifier at little cost. Trying totune a more complex classifier can be more difficult than trying out one of the following to tricks:

Hard Negative Mining: Here we focus on samples which trigger false alarms in our classifiers. Moreexplicitly, if we train a classifier to categorize digits, than we observe and collect (mine) those digits thatconfuse one category, that is those samples that trigger false alarms. For instance, we collect those digitsamples that confuse class ’1’, which could be ’7’s or ’4’s; or for category ’3’ it could be ’2’ or ’8’. Then wetrain the classifier again with those ’hard negatives’ in particular.

Creating ’Artificial’ Samples: This is a trick popular in computer vision. Because the collection andlabeling of image classes is tedious work, one sometimes expands the training set artificially by generating(automatically) more training images, which are slightly scaled and distorted variants of the original samples.This can be tried in combination with adding noise to the samples. Of course, artificial samples are createdfrom the training set only TRN - not from the testing set.

41

7 Clustering - K-Means wiki K-means clusteringDHS p526

ThKo p741

Alp p145

We now introduce the first clustering procedure. Should one have forgotten the purpose of clustering thenrevisit Section 1.1.

The most common clustering procedure is the K-Means method. It is an iterative procedure wherethe cluster centers - also called centroids - gradually move towards the actual cluster centers by repeateddistance measurements (Fig 12). To initiate the procedure, one specifies how many clusters k one expectsin the data; then one chooses k points randomly - from the total of n samples - and those are taken as initialcentroids (b in Fig. 12). Then the procedure iterates through the following two principal steps:

1. Partitioning: Each data point is assigned to its nearest centroid (Fig. 12c). In other words, for eachdata-point its ’membership’ (or label) is determined. This assignment is done based on distance -loosely speaking it is the inverse of the kNN assignment. The assignment results in k partitions,which one can think of as temporary clusters.

2. Mean: With those partitions, the new centroids are calculated by taking the mean - hence the name K-Means (Fig. 12d). The new centroids will be in a slightly different location than the old centroids,namely a bit closer toward the actual center of the final cluster.

With the new centroids obtained in the second step, we then repeat step one, and then we recompute thenew centroids in a second step again. By repeating these two steps, the procedure gradually moves thecentroids towards the final centroids and their clusters. To terminate this cycle, it requires the definition of astopping criterion, e.g. we quit after the new centroids hardly move anymore, which means that the clusterdevelopment has ’settled’. Algorithm 4 summarizes the procedure.

Algorithm 4 K-Means clustering method. Centroid = cluster center.Parameters k: number of clusters.Initialization randomly select k samples as initial centroids.Input x list of vectorsRepeat

1. Generate a new partition by assigning each pattern to its closest centroid2. Compute new centroids from the labels obtained in the previous step(3. Optional): adjust the number of clusters by merging and splitting existing clusters

or by removing small, or outlier clusters.Until stopping criterion fulfilled (e.g. new centroids hardly move anymore)Output L list of labels (a cluster label for each xi)

There are two modes with which step 1 can be carried out, called batch update and on-line update.

Batch Update: Here, the re-assignment of a class label for a point to its closest centroid (step 1) is doneat once for all points simultaneously. This is somewhat coarse but fast, much faster than the on-lineupdate.

On-line Update: the re-assignment of class labels occurs for each point individually, which is more timeconsuming than the batch update, but also more accurate. If you have very large datasets, you maywant to avoid individual update and choose only the batch update.

The shortcoming of the K-Means procedure is that for complex data, it does not always generate the clusters’properly’. There is no clustering method that universally does so, but the K-Means algorithm is a bit lessrobust in that respect than other methods. There are several reasons:- Random initialization: the initial (random) choice of centroids can be crucial for successful clustering.

Modern variants of the K-Means algorithm, such as the K-Means++ have substantially improved thisdownside.

42

Figure 12: The principle of the K-Means method exemplifiedon a 2D example.

a. A dataset. We assume somehow that there are twoclusters present (which is obvious in 2D to a human, but is notobvious to an algorithm), hence we set k = 2.

b. Initialization: we initiate the procedure by choosingtwo data points randomly, the filled circles, which are going tobe our initial centroids.

c. Partitioning (1st step of loop): we determine to which cen-troid each data point is nearest and generate so two partitions:one partition is outlined by a gray line; the rest of the points arethe other partition. This partitioning is one of two principal stepsin the iterative procedure.

d. Calculation Mean (2nd step of loop): we calculatethe new centroids by averaging data-points of the clusters, newcentroids marked as ’x’; this is the second principal step of theiterative procedure. Gray arrow shows movement from old tonew centroid.

We repeat steps one and two, c and d respectively, andthat will gradually move the centroids to the center of the actualclusters. The procedure is stopped until the centroids hardlymove anymore.

a

x

b

d

x

c

- By using a ’mean’ calculation, there is a tendency to interpret the data as compact clusters; elongatedclusters are not so well ’captured’ by the procedure.

A simple solution to the downside of random initialization, is to run the procedure repeatedly, - always start-ing with different initial centroids -, and then to select the partitioning for which the total sum of final distancevalues is smallest. This type of repetition is not explicitly stated in Algorithm 4, that is, we repeat the entirealgorithm T times and chose that L (out of T L’s), whose total distance is smallest. Although this repeatedapplication is no guarantee that the actual centroids are found, it has been shown to be fairly reliable inpractice.

The great advantage of the K-Means procedure is that works relatively fast, because most samples are in-spected only occasionally, namely for the number of iterations until the actual centroid is found. In contrast,the hierarchical clustering procedure, treated in the next Section 8, is much more exhaustive by observingeach sample n− 1 times.

Clustering can be carried out with different types of distance measure, or it can be done with similaritymeasures, see also Appendix A. The Hamming distance and the Manhattan distance are better suited todeal with binary data or discrete-valued data.

43

7.1 Usage in Matlab

Matlab provides the function kmeans, for which one has to specify only the number of clusters k as minimalparameter input:

Lb = kmeans(DAT, 5);

which carries out one repetition by default. But you can also specify more repetitions using the parameterparameter ’replicates’, see the example in F.12. The variable Lb contains a cluster label between one andfive and you then need to write a loop that identifies the indices for the individual clusters, see the functionin Appendix F.12.1 for an example.

The function kmeans firstly performs a coarse clustering using the batch update, followed by a refinementwith an online update, see nested functions called batchUpdate and onlineUpdate. You can turn off theonline update as follows:

Lb = kmeans(DAT, 5, ’onlinephase’, ’off’);

which is recommended when your sample size is several tens of thousands or larger - otherwise yourmachine may be completely occupied by clustering. You can always add the online phase later if you thinkyour PC can deal with the size of the data.

Distance The default distance measure that kmeans uses is the squared Euclidean distance, which isessentially the Euclidean distance. But because taking the square root is not really necessary for generatingthe partitions, the square root function is omitted to save on computation.

Scaling By default the function kmeans uses the (squared) Euclidean measure for which no scaling of thedata takes place. Thus, it may be worth trying also clustering with scaling (Section 2.3). If you choosedifferent distance measures - or in fact similarity measures - then the data are scaled accordingly, seeAppendix A for how that is done.

Options There are a lot of options and parameters you can pass with the function statset. For instanceyou can set a limit for the maximum number of iterations and you can instruct the function to display in-between results:

Opt = statset(’MaxIter’,500,’Display’,’iter’);

Lb = kmeans(DAT,5, ’replicates’,3, ’onlinephase’,’off’, ’options’,Opt);

You then pass the structure Opt to the function kmeans. In the above example, we also instruct the kmeansfunction to perform three repetitions - called replicates here. The parameters passed as such are thereforesuitable for the analysis of very large data - tens of thousands of samples.

NaN The Matlab function kmeans eliminiates any NaN entries, that is, it throws out any rows (observations)where NaN occur. The elimination is done with the function statremovenan. If you want to keep theobservations with NaN entries, then you have to write your own function, see for instance:http://alpha.imag.pub.ro/~rasche/course/patrec/xxxx.

7.2 Implementation

Implementing a primitive version of the K-Means algorithm is not so difficult, because it is a fairly straight-forward algorithm. An example is given in Appendix F.12. It does only batch update for simplicity.

44

http://alpha.imag.pub.ro/~rasche/course/patrec/xxxx

7.3 Determining k - Cluster Validity wiki Determining the number of clusters in a data set

ThKo p880, 16.4.1If we are clueless of how many clusters we expect in our dataset, then we need to test a range of ks. Wethen need some measure or method that tells us what k could be optimal; this method quantifies somehowhow ’suitable’ the obtained clusters are. This quantification is also called cluster validity. There is a numberof methods we can try, here are two popular ones:Decrease of Within-Cluster Scatter: This is probably the simplest method. For each cluster we determine

its so-called within-cluster scatter, which is a measure that expresses how scattered (or spread) acluster is, for instance by determining the mean distance from the centroid to all its cluster members.This measure is used in the Matlab function kmeans twice: once to determine when clustering iscompleted, namely by observing when it does not decrease anymore; and once by choosing the runwith the lowest scatter - if you have chosen to run the K-Means-clustering algorithm multiple timeswith the option ’replicates’. To make a decision which k could be optimal, one plots the scattervalue against varying k, which will result in a decreasing curve and then apply the elbow method - asit was also suggested for choosing k for the PCA (Section 5.1.2).

Silhouette Index: this method goes a step further than the previous measure and also determines howseparated the clusters are from each other, that is it measures also the between-cluster scatter. Todetermine this measure, one can take for instance the distances between a centroid and all other (non-member) points in the dataset. Thus, this measure is computationally significantly more expensivethan the previous simple one. Matlab provides the function silhouette to determine a so-calledsilhouette index for each cluster. One can make a selection based on the average of the k silhouetteindices; or one can choose k where all indices are above zero.

7.4 Recapitulation

The K-Means clustering algorithm is simple to use and delivers results quickly. It is therefore recommendedthat you always use it to analyze your data - even if you think your data deserve better analysis. The K-Means algorithm may give you a ’coarse’ result only, which you can then attempt to improve with anotherclustering algorithm.

Complexity The overall complexity is fairly low, namely O(ndkT ), where n is the sample size, d thedimensionality, k the number of clusters and T the number of replicates (DHS p527). Note that if you performan excessive number of replicates on a small dataset, you may exceed the complexity of the hierarchicalclustering algorithm (Section 8), in which case it may be more feasible to use that one.

Advantages K-Means works relatively fast due to its relatively low complexity. It is therefore suitablefor very large datasets unlike some other clustering methods. On the Matlab site ’File Exchange’, a userprovides an implementation that can deal with Gigabytes of data - it requires however a C compiler.

Disadvantages- Specification of the number of clusters k: it is not obvious what a useful k would be - unless we know

exactly how many clusters to expect. We have mentioned several methods to choose k automatically,but none of those procedures has proven to be effective for all patterns.

- It tends to favor hyper-ellipsoidal clusters - in particular if you use the within-cluster scatter for evaluation,e.g. for terminating or for validity.

- The algorithm does not guarantee optimal results due to the random initial selection.- It can not handle categorical data well, due to the mean function.- It is sensitive to outliers, because each observation is enforced to be part of a cluster.

How one can address some of the disadvantages is treated later in Section 17.

45

8 Clustering - Hierarchical DHS p550

ThKo p653

HKP p456, 10.3Hierarchical clustering is a more thorough clustering procedure than the K-Means method, because it an-alyzes the distances for each pair of observations. It would therefore be the preferred clustering methodin principle, however the exhaustive pair-wise distance measurements are computationally more intensiveand the method is therefore not easy to apply to large datasets. Hierarchical clustering consists of threeprincipal steps:

1. Pairwise distances: The pairwise distances between all points are determined resulting in a distancematrix. This is a time-consuming calculation. In comparison, the K-Means algorithm does computeonly a proportion of these inter-point distances.

2. Linking to a hierarchy: Using the distance matrix, a nested hierarchy of all n data points is generated,by gradually linking all the pairs, starting with the tightest pair. A hierarchy can be represented bya tree, which here is called dendrogram. This step is far less costly than the first one (the pairwisedistance calculation), as one compares only a list of distances.

3. Thresholding the hierarchy: We cut the tree horizontally at some distance and the resulting ’branches’form the clusters.

8.1 Pairwise Distances, Distance Matrix

One can think of the list of pairwise distance values as filling one half of the distance matrix: Fig. 13 showshow the lower half is filled. Sometimes one desires the entire (full) matrix, for instance for certain clusteringor classification algorithms, then one would copy the values into the corresponding positions of the otherhalf.

Figure 13: The distance matrix. The values for the pairwise dis-tances of a list of n vectors (data matrix D for instance) can bethought of filling the lower half of a so-called n×n distance matrix(or the upper half of a matrix - not shown). The diagonal holdszero values if distances are used - or maximal values if similaritiesare used.

1

2

3

n

1 2 3 n. .

.

.

0

0

0

0

0

0

d21

d31

dn1

d32

dn2 dn3

d.. d.. d..

d.. d.. d.. d..

d.. d..

The calculation of those pairwise distances is the costliest step of the hierarchical clustering procedures,in terms of both space and time. For n vectors (observations), one needs to compute n(n − 1)/2 values,roughly n2/2. For only 50’000 samples, those values require already ca. 4.7 Gigabytes of memory forholding all distance values in data-type single. For the full matrix, this doubles the memory requirements to9.4 Gigabytes. Obviously, memory is one limitation of hierarchical clustering. However, it is not the actuallimiting factor, because one can place the distance matrix onto the hard-drive in principle, and then movesub-matrices between RAM and hard-drive. The more limiting factor is the temporal cost, because theduration to compute all the distance values grows exponentially for increasing dimensionality.

46

Usage in Matlab Pairwise distance can be computed with the command pdist:

Dis = pdist(DAT); % pair-wise distances [1 nSmp*(nSmp-1)/2]

DM = squareform(Dis); % distance matrix

If you deal with large matrices, it is useful to calculate the memory required for the output (Appendix F.4),otherwise your PC may occupy the entire computational resources and you need to restart your machine.

If the entire distance matrix is needed then we can use the command squareform. Again, watch yourmemory.

For the linkage and cluster procedure - as treated next - the full matrix is however not required.

8.2 Linking (Agglomerative)

There exist two principal types of linking, namely agglomerative and divisive:

Agglomerative linking starts with the tightest pairs, the pairs with smallest distance, and then graduallylinks to more distal pairs; agglomerative linking is the more popular one and we therefore consideronly that one.

Divisive linking works the opposite way: it starts by considering all pairs and then tries to gradually breakdown the links. Divisive linking is computationally so demanding that it is rare in practice.

Agglomerative linking is also known as Bottom-Up clustering or Clumping. It starts by considering everydata point as an individual cluster; in other words for n data points one starts with n clusters. Because acluster with a single member is also called a singleton cluster, this linking procedure can be said to startwith n singleton clusters. Then one forms the hierarchy by successively merging data points into pairs ofdata points, whose distance is smallest, and one continues that type of linking until all points have beenmerged into a single cluster. The general algorithm is as follows, whereby the function g represents the linkmetric, a measure that takes the distance between a data point and a set of points.

Algorithm 5 Generalized Agglomerative Scheme (GAS). g: link metric (a type of distance measure betweena point and a set). C: a cluster with members xi. R: set of clusters. From ThKo p654.

Parameters cut threshold θInitialization t = 0Initialization choose Rt = {Ci = {xi}, i = 1, .., N} as the initial clustering.Repeat:

t = t+ 1Among all possible pairs of clusters (Cr, Cs) in Rt−1 find the one, say (Ci, Cj), such that

g(Ci, Cj) = minr,s

g(Cr, Cs) (8)

Define Cq = Ci ∪ Cj and produce the new clustering Rt = (Rt−1 − {Ci, Cj}) ∪ {Cq}Until all vectors lie in a single cluster resulting in the final set of clusters.Cut hierarchy at level θ

The link metric g can be implemented in various ways, Figure 14 shows some examples. The nearest-neighbor link metric is also referred to as single linkage; the farthest-neighbor link metric is also referredto as complete linkage. The choice of link metric biases the clustering process such that it favors certainshapes of clusters. We elaborate on those two types below, by looking at the toy example in Figure 15 topleft.

Note that in above equation 8, the minimum function min before function g is not part of g itself.

Single-Linkage: this is also known as the Nearest-Neighbor linkage; it uses the minimum distance tocompute the distances between samples and merging clusters, see Figure 14a. For that reason thereappear only three unique distance values in the dendrogram of Figure 15 center left: the smallest

47

Figure 14: Types of distance measurements between a singlepoint (left, filled) and a set of points (right). In the context ofclustering these are called link metrics.

a. Nearest Neighbor: First, the distances between thesingle point and the points in the set are determined, of whichthen the minimum is chosen.

b. Farthest Neighbor: as in a., but now the maximumof those distances is taken.

c. Average: first, for the set of points the average iscomputed (marked as ’x); then the distance between the singlepoint and the average point is taken.

a

b

x

c

distance value (equal 1.5) reflects the minimum spacing between the points in the upper row - the fivepoints at y-coordinate 0.8; the second smallest distance value (equal 2) reflects the minimum spacingbetween the four lower points. The third distance reflects the distance between the upper row ofpoints and the set of lower four points. In this example, we intentionally cut the dendrogram betweenthe second and third distance and arrive so at two final clusters. In general, the single-linkage methodtends to generate chain-like clusters, which is evident by the selection of the upper row of points asone cluster (Figure 15, lower left).

Complete-Linkage (aka Farthest Neighbor): uses the maximum distance to compute the distancesbetween samples and clusters, see Figure 14b. The corresponding dendrogram looks a bit lessintuitive now (Figure 15, center right). There are more unique distance values due to the choice ofthe largest distance between two sets of merging clusters. The method tends to generate compactclusters, which in this example is not so obvious, except that it did not acknowledge the presence of aline of points as did the linkage method.

There exist of course many other types of linkage metrics. For instance the average metric takes the meandistance value of the clusters under consideration, see Figure 14c.

Usage in Matlab Matlab provides the function linkage to organize the hierarchy. We specify with thesecond parameter the desired linkage metric, for instance:

Dis = pdist(DAT); % pairwise distances

LnkSin = linkage(Dis, ’single’); % single linkage method

LnkCmp = linkage(Dis, ’complete’); % complete linkage method

The output variables Lsn and Lcm are arrays containing the connections of the tree (N = number of observa-tions). Each array is of size (N −1)×3: the first two columns contain the tree connections, pairs of samplesor clusters that are closest; the third column contains the distances between those pairs of samples orclusters. The hierarchy can be displayed using the command dendrogram.

48

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Data

4 5 1 2 3 6 7 8 90

0.2

0.4

0.6

Dis

tan

ce

Single Linkage

4 5 1 2 3 6 7 8 90

0.2

0.4

0.6

Complete Linkage

0 0.5 10

0.2

0.4

0.6

0.8

12 Clusters

0 0.5 10

0.2

0.4

0.6

0.8

15 Clusters

Figure 15: Top Graph: A dataset consisting of 9 points.Center Row: the dendrogram for the single linkage (left) and the complete linkage method (right). A distance thresholdat value equal 0.201 is set (gray-dashed line).Bottom Row: the clusters formed by the distance threshold.

8.3 Thresholding the Hierarchy - Cluster Validity ThKo p690, 13.6

The link tree can be thresholded in different ways. The simplest is to cut it a specific distance value, thusresulting in some number of clusters. If we have a specific number of clusters k in mind, then we cut thetree at the corresponding level that results in exactly k clusters. Or we wish to apply a criterion, similar towhat we did for K-Means clustering (Section 7.3). Here are two criteria that are popular with hierarchicaltrees:- Inconsistency coefficient: expresses how inconsistent a link is in the tree - the larger its value, the more

inconsistent it is. For a given link, the inconsistent value is calculated from the height values of allthose links that are on the same level.

- Cophenetic (correlation) coefficient: is a singular value that describes how faithfully a dendrogram pre-serves the pairwise distances between the original unmodeled data points. It has been used a lot in

49

biostatistics. The closer the value to one, the better the ’preservation’.

Usage in Matlab Matlab provides the function inconsistent and cophenet to generate those measures,where Lnk is the output as generated by the command linkage (see Section above) and Dis are thepairwise distances:

Inc = inconsistent(Lnk); % [nObs-1 4]

coph = cophenet(Lnk, Dis); % single value

Matlab offers the function cluster to cut the link hierarchy in those various ways. If you specify inconsistentas criterion, then it will automatically call the function script inconsistent.

Lbl = cluster(Lnk,’maxclust’, 3); % three or less clusters

Lbl = cluster(Lnk,’cutoff’, ’distance’, 2.01);

Lbl = cluster(Lnk,’cutoff’, ’inconsistent’, 1.25);

Matlab also offers the function script clusterdata, which performs performs the three steps all together:pairwise distances, linking and thresholding the hierarchy.

8.4 Recapitulation

Application Hierarchical clustering is suitable in particular when hierarchies need to be described, i.e.biological taxonomy.

Advantages- Hierarchical clustering offers easy interpretation of the results - as does the Decision Tree classifier.- In general, the exhaustive analysis permits to find cluster centers more reliably than the K-Means method

(Section 9).

Disadvantages- Complexity: the exhaustive (pair-wise) comparison of all samples (observations) implies square com-

plexity: O(N2). Hierarchical clustering is thus suitable for databases of limited size only. There existvarious methods to reduce the complexity to some extent, but those can also be applied to many otherclustering methods, that have a smaller complexity by its nature.

- Linkage determines shape: the choice of linkage method determines what type of cluster shape we willobtain. That means that the linkage method imposes a certain structure on the data.

- Lack of method or algorithm to determine an optimal cluster number: it is difficult to determine an optimalthreshold (cutoff) level automatically - as it is difficult to determine an optimal k in the K-Means proce-dure. This is a general challenge of any clustering algorithm and therefore not a disadvantage specificto this type of clustering method.

Usage in Matlab The three steps can be carried out as follows:

Dis = pdist(DAT); % pairwise distances

Lnk = linkage(Dis, ’single’); % single linkage method

Lbl = cluster(Lnk,’cutoff’, 1.25);

Or use the function clusterdata, i.e.

Lbl = clusterdata(DAT, ’maxclust’, 3);

50

9 Decision Tree ThKo p215, 4.20, pdf 228

Alp p185, ch 9

DHS p395A decision tree is a multistage decision system, in which classes are sequentially rejected until we reach afinally accepted class. The decision process corresponds to the flow diagram we learned in school.

A decision tree is particularly useful if data is nominal, see again Section 1.3.1. Nominal data arediscrete and without any natural notion of similarity or even ordering. One often uses lists of attributes toexpress objects. A common approach is to specify the values of a fixed number of properties by a propertyd-tuple. For example, consider describing a piece of fruit by the four properties of color, texture, taste andsmell. Then a particular piece of fruit might be described by the 4-tuple red, shiny, sweet, small, which is ashorthand for color = red, texture = shiny, taste = sweet and size = small. Such data can be classified withdecision trees.

Figure 9 shows an example for a 2D dataset. On the left is shown an (artificial) dataset, which consistsof 6 regions, whose points belong to four different classes; classes 1 and 3 have two separate regions each.The vertical and horizontal lines represent decision boundaries that separate the classes. For example thedecision boundary at x1 = 1/4 separates part of class ω1 from all others. In order to assign a testing sample{x1, x2} to one of the four classes, one applies a number of decision boundaries - the decision tree is anefficient way to train and apply these decisions. The illustrated example may appear as very simple, but isin fact very difficult to solve for other classifiers, as in this example the data points of some classes lie invery different regions of the entire 2D space.

Figure 16: Left: a pattern divided into rectangular subspaces by a decision tree. Right: corresponding tree. Circles:decision nodes. Squares: leaf/terminal nodes. [Source: Theodoridis, Koutroumbas, 2008, Fig 4.27,4.28]

On the right side in Figure 9 is shown a decision tree that separates all instances and assigns themto their corresponding classes. A decision tree is drawn from top-to-bottom and consists of three types ofnodes that are connected by links or branches:

root node: the top node of the tree; it is a decision node, labeled t0 in this case.

decision node: tests a component of the multi-dimensional feature vector; drawn here as circles andlabeled ti, i = 0, .., ndecisions.

leave (or terminal) node: assigns the instance to a class label; drawn here as squares and labeled ωi,i = 1, .., nclasses.

51

The example decision tree consists of 5 decision nodes and 6 leave nodes. Given a (testing) data point,e.g. x1 = 0.15, x2 = 0.5, the decision node t0 tests the first component, x1, by applying a threshold valueof 1/4: if the value is below, the data point is assigned to class ω1; if not, the process continues with binarydecisions of the general form of xi > α (α = threshold value) until it has found a likely class label.

The example of Figure 9 is a binary decision tree and splits the space into rectangles with sides parallelto the axes; for higher dimensionality (3D or more) those would be called hyper-rectangles. Other types oftrees are also possible, that split the space into convex polyhedral cells or into pieces of spheres. Note thatit is possible to reach a decision without having tested all available feature components.

In praxis, we often have data of higher dimensionality and we therefore need to develop binary decisiontrees automatically, that is, we need to learn somehow when which component xi is tested with what thresh-old value αi. The learning rule selects threshold values where the decision achieves higher class ’purity’,meaning individual class frequencies should either increase or decrease with every split. At the beginning,the training set X is considered ’impure’ because all class labels are present. With every following decision- and hence the splitting of the training set - the resulting split in class labels is supposed to be purer. Thetraining procedure has therefore three key issues: impurity, stop splitting and class assignment rule. Thoseissues are elaborated now:

1: Impurity

Every binary split of a node, t, generates two descendant nodes, denoted as tY and tN according to the’Yes’ or ’No’ decision; node t is also referred to as the ancestor node (when viewing such a split). Thedataset arriving at the ancestor node is split into subsets XtY , XtN , which in turn are fed to the descendantnodes, see Figure 17; the root node is associated with the entire training set X.

Figure 17: The (learning) datasets associated with thesplit at a decision node. The set of training vectors Xt ar-rives at the ancestor node and is split into sets XtN andXtY , which are fed to the descendant nodes. For eachnode a (class) impurity is calculated, I(t) at the ances-tor node, I(tY ) and I(tN ) at the respective descendantnodes. The learning rule seeks to decrease the impurityat splits.

ancestor

descendants

X t

X tN X tY

Now the crucial point: every split must generate subsets that are more ’class homogeneous’ compared tothe ancestor’s subset Xt. This means that the training feature vectors in XtN and the ones in XtY show ahigher preference for specific class(es), whereas data inXt are more equally distributed among the classes.Example: for a 4-class task: assume that the vectors in subset Xt are distributed among the classes withequal probability (percentage). If one splits the node so that the points that belong to classes ω1 and ω2 formsubset XtY , and the points from ω3 and ω4 form subset XtN , then the new subsets are more homogeneouscompared to Xt or ’purer’ in the decision tree terminology.

The goal, therefore, is to define a measure that quantifies node impurity and split the node so that theoverall impurity of the descendant nodes is optimally decreased with respect to the ancestor node’s impurity.Let P (ωi|t) denote the probability that a vector in the subset Xt, associated with a node t, belongs to classωi, i = 1, 2, ...,M . A commonly used definition of node impurity, denoted as I(t), is the entropy for subsetXt:

I(t) = −M∑i=1

P (ωi|t) log2 P (ωi|t) (9)

where log2 is the logarithm with base 2 (see Shannon’s Information Theory for more details). We have:

52

- Maximum impurity I(t) if all probabilities are equal to 1/M (highest impurity)- Least impurity I(t) = 0 if all data belong to a single class, that is, if only one of the P (ωi|t) = 1 and all the

others are zero (recall that 0 log 0 = 0).When determining the threshold α at node t, we attempt to chose a value such that ∆I(t) is large.

Example: given is a 3-class discrimination task and a set Xt associated with node t containing Nt = 10vectors: 4 of these belong to class ω1, 4 to class ω2, and 2 to class ω3. Node splitting results into: subsetXtY , with 3 vectors from ω1, and 1 from ω2; and subset XtN with 1 vector from ω1, 3 from ω2, and 2 fromω3. The goal is to compute the decrease in node impurity after splitting. We have that:

I(t) = − 4

10log2

4

10− 4

10log2

4

10− 2

10log2

2

10= 1.521

I(tY ) = −3

4log2

3

4− 1

4log2

1

4= 0.815

I(tN ) = −1

6log2

1

6− 3

6log2

3

6− 2

6log2

2

6= 1.472

Hence, the decrease in impurity at this split is

∆I(t) = 1.521− 4

10(0.815)− 6

10(1.472) = 0.315.

2: Stop Splitting

The natural question that now arises is when one decides to stop splitting a node and declares it as a leafof the tree. A possibility is to adopt a threshold T and stop splitting if the maximum value of ∆I(t), over allpossible splits, is less than T . Other alternatives are to stop splitting either if the cardinality of the subsetXt is small enough or if Xt is pure, in the sense that all points in it belong to a single class.

3: Class Assignment Rule

Once a node is declared to be a leaf, then it has to be given a class label. A commonly used rule is themajority rule, that is, the leaf is labeled as ωj where

j = argmaxiP (ωi|t)

In words, we assign a leaf, t, to that class to which the majority of the vectors in Xt belong.

Learning A critical factor in designing a decision tree is its size. The size of a tree must be large enoughbut not too large; otherwise it tends to learn the particular details of the training set and exhibits poorgeneralization performance. Experience has shown that use of a threshold value for the impurity decreasesas the stop-splitting rule does not lead to trees of the right size. Many times it stops tree growing either tooearly or too late. The most commonly used approach is to grow a tree up to a large size first and then prunenodes according to a pruning criterion. A number of pruning criteria have been suggested in the literature.A commonly used criterion is to combine an estimate of the error probability with a complexity measuringterm, e.g., number of terminal nodes.

It is not uncommon for a small change in the training dataset to result in a very different tree, meaningthere is a high variance associated with tree induction. The reason for this lies in the hierarchical natureof the tree classifiers. An error that occurs in a higher node propagates through the entire subtree, that isall the way down to the leaves below it. The variance can be improved by using so-called random forests,which we will introduce in the section on ensemble classifiers (Section 10.3).

53

Algorithm 6 Growing a binary decision tree. From ThKo p219.Parameters Stop-splitting threshold TInitialization Begin with the root node Xt = X.For each new node t

For every feature xk (k = 1, ..., l)For every value αkn (n = 1, ..., Ntk)

- Generate XtY and XtN for: xk(i) ≤ αkn, i = 1, ..., Nt- Compute ∆I(t|αkn)

Endαkn0 = argmaxα∆I(t|αkn)

End[αk0n0

, xk0 ] = argmaxα∆I(t|αkno)

If the stop-splitting rule is metdeclare node t as a leaf and designate it with a class label

ElseGenerate nodes tY , tN with corresponding XtY , XtN for: xk0 ≤ αk0n0

EndEnd

9.1 Usage in Matlab

In Matlab we use the function fitctree to evaluate data with a tree classifier (formerly under the commandclassregtree):

MdCv = fitctree(DAT, GrpLb, ’kfold’,nFld);

pcTree = 1-kfoldLoss(MdCv);

We can visualize a tree using the function view:

view(MdCv.Trained1,’Mode’,’graph’);

Appendix F.14 gives a complete example; the usage of fitting and evaluation functions is analogous to theuse of the kNN-classifer function or the linear-classifier function, see again the overview of classifiers inAppendix F.1 or see the explicit code for the kNN classifier in Appendix F.5.

9.2 Recapitulation

Application Decision tree classifiers are particularly useful when the input is non-metric, that is when wehave categorical variables. They also treat mixtures of numeric and categorical variables well.

Advantages Learning duration is short. And due to their structural simplicity, decision trees are easilyinterpretable. Even for Random Forests the learning duration is relatively short.

Disadvantages Learning is not robust: slight changes in the dataset can lead to the growth of verydifferent trees. However with Random Forest - introduced in the next section - , those ’variances’ areaveraged out.

54

10 Ensemble Classifiers Alp p419, ch 17

HKP p377, 8.6

An ensemble classifier is a classifier that combines the classification estimates of multiple, individual clas-sifiers; one could say the ensemble classifier collects advice from many experts and then arrives at itsown decision. The previously introduced classifiers - the kNN, the Linear Classifier and the Decision Tree- attempt to obtain a high classification accuracy by training a single classifier until it has reached ’perfec-tion’. In contrast, ensemble classifiers use many less-than-perfect classifiers, each one representing onlya mediocre discrimination function; these base classifiers - or base learners - are then combined to form asingle (total) decision. There are two principal motivations for evaluation with an ensemble classifier:

1. We have measurements from separate sources, e.g. a visual signal and an audio signal, each signalwith its own set of features. Then, one trains a classifier for each source separately and then combinestheir decision - in this case aka data fusion. Section 10.1 introduces the basics of combining classifiersthis way.

2. We may try to solve the classification problem with a set of classifiers, whereby an individual classifierperforms merely above chance level. By the combination of these ’opinions’ we may obtain an expertadvice, which is hopefully better than the expert advice of a single classifier. An example is given inSection 10.2.

Labels or Graded Measures When we combine the base learners, we can work with either the categorylabels as returned so far by predict or we can work with the graded measure that was calculated beforethe category label was chosen by predict. If we choose to work with labels only, then we essentially lookat a code of binary values and the most successful method into that direction is the so-called method oferror-correcting output codes (ECOC), see Section 10.7 for more. This is also the method use when weintend to build an ensemble classifier that tries to solve a multi-class problem by training K one-versus-allclassifiers, which is what one does when one applies a Support-Vector Machine to multi-class problems(Section 13).

If you intend to work with a graded measure, traditionally called the posterior probability, then we can trya set of functions or even is called various names: in Matlab ’scores’, which express a confidence; in Python’probabilites’. We obtain that measure in Matlab as the second output argument of the function predict

and in Python by the function predict proba:

[~,Score] = predict(Mdl, TST); % called ’score’ in Matlab [nSmp nFet]

Prob = Mdl.predict_proba(TST) # returns probabilities [nSmp nFet]

If we combine classifiers with mixed outputs, labels and graded measures, then we need to make themcompatible somehow. Section 10.4 gives some ideas.

Base-Learner Search To pursue the search of an ensemble classifier along the second motivation, thereexist two principal techniques. One is to either choose different subsets of the training samples and then totrain individual classifiers on those subsets. This is explained under Bagging in Section 10.2. Another wayis to start with suboptimal classifiers on the entire training set and then sub-select those samples, that weremisclassified in the first round. This is introduced in Section 10.5.

10.1 Voting

The simplest way to combine multiple classifiers is by voting, which corresponds to taking a linear combi-nation of the learners

yi =∑j

wjdji where wj ≥ 0,∑j

wj = 1. (10)

This is also known as ensembles and linear opinion pools. In the simplest case, all learners are given equalweight (wj = 1/L), which is also called ”simple voting”: it corresponds to taking an average. But one couldalso simply take the sum and that is probably the most widely used in practice. Other combination rules are:

55

Median yi = medianj dji robust to outliersMinimum yi = minj dji pessimisticMaximum yi = maxj dji optimisticProduct yi =

∏j dji veto power

The median rule is more robust to outliers. The minimum and maximum rules are pessimistic and optimistic,respectively. With the product rule, each learner has veto power: regardless of the other ones, if one learnerhas an output of 0, the overall output is set to 0. Note that after the combination rules, yi do not necessarilysum up to 1.

If the outputs dji are not posterior probabilities, these rules require that the outputs be scaled to the samerange. Section 10.4 gives examples.

An example of an all-in-one function was given in the overview F.1. The code example in F.15 gives anexplicit example.

If the dataset consists of features obtained from different sources, then one should definitely try an en-semble classifier with a voting scheme as it does not involve any particular tuning, that is it comes at littleeffort to test this variant. For instance, we have data with audio and visual features: we train solely the visualfeatures with one LDA and obtain the corresponding posterior values (3rd argument, see Section 4.2), andwe train solely the audio features with another LDA and obtain the corresponding posterior values. We thencombine the two sets of posteriors with any rule that gives us the maximum performance.

10.2 Bagging

Bagging is a voting method whereby base learners hj are made intentionally different by training them ondifferent subsets of the training sets. Bagging can reduce variance and thus reduce the generalization errorperformance.

The subsets are generated by bootstrap, that is by drawing randomly a subset of samples from thetraining set with replacement (hence the name bagging = bootstrap aggregation), see again table 1 inSection 6.1. Given a training set X, we create B variants, X1, X2, ..., XB , by uniformly sampling from Xwith replacement. (Because sampling is done with replacement, it is possible that some instances aredrawn more than once and that certain instances are not drawn at all). One can use randsample to createdifferent subsets of X, e.g.

for i = 1:nSub

Ixr = randsample(nTrnSamp, nSubSize); % random sampling

DATsub = DAT(Ixr,:); % select only first nSubSize of Ixr and thus X

...train a classifier on DATsub...

end

For each of the training set variants, Xi, a classifier hi, is constructed. The final decision is in favor of theclass predicted by the majority of the subclassifiers, hi, i = 1, 2, ..., B.

By randomly selecting a subset, the individual base learners will be slightly different (remember mo-tivation no. 2 above). To increase diversity, bagging works better, if the base learner is trained with anunstable algorithm, such as a decision tree, a single or multilayer perceptron, or a condensed NN. Unstablemeans that small changes in the training set cause a large difference in the generated learner, namely ahigh performance variance.

Bagging as such is a method worth trying as it also involves little complications. Bagging is successfullyused in some applications. For instance the motion recognition software by Kinect Microsoft uses RandomForests for classification (Section 10.3).

56

10.3 Random Forest wiki Random forest JWHT p316, 8.2

HKP p382, 8.6.4

A random forest is an ensemble classifier that employs the technique of bagging as introduced before. Foreach randomly selected subset of the entire training set, a decision tree is trained. Such a tree representsthe base learner, or also called weak learner in this context. A random forest therefore consists of multipledecision trees, hence its name. To classify a new sample, the sample is applied to each individually traineddecision tree and the output of each one is then combined to arrive at a pooled, single decision.

The choice of number of decision trees is a matter of heuristics, very much like the choice of the numberof k for the kNN classifier.There are two notable cases where decision trees are in particular successful:- document classification, as part of the field of text mining (text categorization)- movement classification, as part of the domain of computer vision. Microsoft’s Kinect device recognizes

a user’s movements by use of such random forests.

Usage in Matlab We use the command TreeBagger to learn the ensemble of trees, and then employ thecommand predict to classify the testing samples:

Forest = TreeBagger(100, TREN, Grp.Trn);

PredC = predict(Forest, TEST);

The output of function predict is a list of strings and we need to convert those by using num2str forinstance. A full example is given in Appendix F.16.

10.4 Component Classifiers without Discriminant Functions DHS p498, 9.7.2, pdf 576

If we create an ensemble classifier, whose base learners consist of different classifier types, e.g. one is aLDA and the other is a kNN classifier, then we need to adjust their outputs in particular if they do not computediscriminant functions. In order to integrate the information from the different (component) classifiers wemust convert their outputs into discriminant values. It is convenient to convert the classifier output g̃i toa range between 0 to 1, now gi, in order to match them to posterior values of a (regular) discriminantclassifiers. The simplest heuristics to this end are the following:Analog (e.g. NN): softmax transformation:

gi =eg̃i∑cj=1 e

g̃i. (11)

Rank order (e.g. kNN): If the output is a rank order list, we assume the discriminant function is linearlyproportional to the rank order of the item on the list. The values for gi should thus sum to 1, that isscaling is required.

One-of-c (e.g. Decision Tree): If the output is a one-of-c representation, in which a single category isidentified, we let gj = 1 for the j corresponding to the chosen category, and 0 otherwise.

The following table gives a simple illustration of these heuristics (example taken from Duda/Hard/Storck).

Other scaling schemes are certainly possible too. As pointed out one needs to look into the predictionfunctions given by the software packages to understand what graded measures can be obtained.

57

10.5 Boosting

Boosting is a method that focuses on samples that are difficult to discriminate, namely those samplesthat tend to be unusual for their own class. The boosting procedures starts by training the entire set ina superficial way and then analyzes, which samples were not properly classified. Those mis-classifiedexamples are then selected and a second round of training is carried out. We would have accumulated twobase-learners so far, one for the first round and one for the second round. One can continue this cycle ofselection of mis-classifications and separate training until we have correctly classified all training samplesand we would have accumulated a sequence of base-learners. When we evaluate a new sample, then weapply that sequence and generate so a graded measure.

10.6 Learning the Combination

Instead of choosing a combination rule (see table in Section 10.1), we may try to optimize the combinationstage by training a classifier on the discriminant values being combined. For instance, we train an ’opti-mization’ classifier to combine the discriminant values for an LDA and a kNN classifier. That optimizationclassifier would take a 2 × K matrix as input, two rows because we have the LDA and the kNN classifier(K=number of classes), and it would return a vector of length K as the final posterior. There are also otherways to combine component classifiers.

To provide a proper prediction accuracy, we need to train the base classifiers and the combination stageseparately. That means we need to split the training set into a subset for training the base classifiers only,and a validation subset for the combination stage, see also Section 6.1.1 again. Ultimately, this learningscheme is more elaborate and requires more training data, but sometimes we gain another one or evenmore percent in prediction accuracy.

10.7 Error-Correcting Output Codes

When we look at the predicted labels for the training set for the different base learners, we are essentiallyfaced with a binary table. In that table there may appear systematic errors which in turn can be correctedby learning appropriate modifications.

This technique is in particular used when solving a multi-class problem with K binary classifiers, such aswhen a SVM is applied to a multi-class task. In that case we build K one-vs-all base learners, but one canalso build a set of one-vs-one classifiers, K(K1)/2 binary classifiers in total. Of course, with such schemesthe amount of computation increases correspondingly. When constructing such ensemble classifier, oneshould pay attention to the class imbalance problem (Section 6.4.1). The software functions will typicallytake care of any imbalance.

In Matlab that error-correcting methods are implemented with fitcecoc.

10.8 Recapitulation

Ensemble classifiers are particularly useful if your data are heterogeneous, for example the data comefrom different sources or it contains different data types or its classes are heterogeneous consisting so ofsub-classes essentially. Then it certainly makes sense to attempt to try an ensemble classifier.

An ensemble classifier often achieves its best accuracy, when its base learners are sub-optimally trained.Thus, combining the output of Support-Vector Machines may perform worse than combining the output ofLinear Discriminants, Naive Bayes classifiers, etc.

The downside of ensemble classifiers is that it can take some time to find the right combination of baselearners. The upside is that one can achieve good prediction accuracies with a relatively simple method incomparison to complex methods such as Deep Neural Networks.

58

11 Recognition of SequencesDHS p413, s 8.5, pdf 481

ThKo p487, s 8.2.2Now we look at classification of strings or sequences that again can not be compared with typical metricmethods. It is another case of classification with nominal data - the first one we introduced with DecisionTrees in Section 9. The following methods are an approach where the pattern is described by a variablelength string of nominal attributes, such as a sequence of base pairs in a segment of DNA, e.g., ’AGCTTCA-GATTCCA’; or the letters in word/text. The methods are useful for dealing with sequences in general.

A particularly long string is denoted text. Any contiguous string text that is part of x is called a sub-string, segment, or more frequently a factor of x. For example, ’GCT’ is a factor of ’AGCTTC’. There isa large number of problems in computations on strings. The ones that are of greatest importance in patternrecognition are:- String matching: Given x and text, test whether x is a factor of text, and if so, determine its position.- Edit distance: Given two strings x and y, compute the minimum number of basic operations - character

insertions, deletions and exchanges - needed to transform x into y.- String matching with errors: Given x and text, find the locations in text where the ’cost’ or ’distance’ of

x to any factor of text is minimal.- String matching with the ’dont care’ symbol: This is the same as basic string matching, but with a special

symbol, ∅, the dont care symbol, which can match any other symbol.

We introduce only the first two.

11.1 String Matching Distance

The simplest detector method is to test each possible shift, which is also called ’naive string matching’. Amore sophisticated method, the Boyer-Moore algorithm, uses the matched result at one position to predictbetter possible matches, thus not testing every position and accelerating the search.

Figure 18: The general string-matching problem is to findall shifts s for which the pattern x appears in text. Anysuch shift is called valid. In this case x = ”bdac” is in-deed a factor of text, and s = 5 is the only valid shift.[Source: Duda,Hart,Storck 2001, Fig 8.7]

Usage In Matlab The function strfind carries out this simple type of matching.

11.2 Edit Distance

The edit distance between x and y describes how many fundamental operations are required to transformx into y. The fundamental operations are:

- Substitutions: a character in x is replaced by the corresponding character in y.

- Insertions: a character in y is inserted into x, thereby increasing the length of x by one character.

- Deletions: a character in x is deleted, thereby decreasing the length of x by one character.

Let C be an m×n matrix of integers associated with a cost or distance and let δ(·, ·) denote a generalizationof the Kronecker delta function, having value 1 if the two arguments (characters) match and 0 otherwise.The basic edit-distance algorithm (Algorithm 7) starts by setting C[0, 0] = 0 and initializing the left columnand top row of C with the integer number of steps away from i = 0, j = 0. The core of this algorithm findsthe minimum cost in each entry of C, column by column (Figure 19). Algorithm 7 is thus greedy in that eachcolumn of the distance or cost matrix is filled using merely the costs in the previous column.

59

Algorithm 7 Edit distance. From DHS p486.Initialization x, y, m← length[x], n← length[y]Initialization C[0, 0] = 0Initialization For i = 1..m,C[i, 0] = i, EndInitialization For j = 1..n,C[0, j] = j, EndFor i = 1..m

For j = 1..nIns = C[i− 1, j] + 1; % insertion costDel = C[i, j − 1] + 1; % deletion costExc = C[i− 1, j − 1] + 1− δ(x[i],y[j]) % no (ex)change costC[i, j] = min(Ins,Del, Exc) % the minimum of the 3 costs

EndEndReturn C[m,n]

As shown in Figure 19, x = ”excused” can be transformed to y = ”exhausted” through one substitutionand two insertions. The table shows the steps of this transformation, along with the computed entries of thecost matrix C. For the case shown, where each fundamental operation has a cost of 1, the edit distance isgiven by the value of the cost matrix at the sink, i.e., C[7, 9] = 3.

Figure 19: The edit distance calculation for strings x and y can be illustrated in a table. Algorithm 3 begin-s at source, i = 0, j = 0, and fills in the cost matrix C, column by column (shown in red), until the full ed-it distance is placed at the sink, C[i = m, j = n]. The edit distance between excused and exhausted is thus 3.[Source: Duda,Hart,Storck 2001, Fig 8.9]

The algorithm has complexityO(mn) and is rather crude; optimized algorithms haveO(m+n) complexityonly. Linear programming techniques can also be used to find a global minimum, though this nearly alwaysrequires greater computational effort.

Note: as mentioned in the introduction, the pattern can consist of any (limited) set of ordered elements,and not just letters. Example: The edit distance is sometimes applied in computer vision, specifically shaperecognition, for which a shape is expressed as a sequence of classified segments.

Usage in Matlab : The Matlab toolbox ’Bioinformatics’ provides a set of functions, e.g. localalign,nwalign, etc.

60

12 Density Estimation

Density estimation is the characterization of a low-dimensional data distribution, typically a one-dimensionaldistribution only, sometimes two-dimensional, rarely three-dimensional. Density estimation is similar to clus-tering in principal (Section 7) and sometimes it is even considered as part of the topic of clustering. Whilein clustering one attempts to identify densities in higher-dimensional data, in density estimation in contrast,that data is typically only one-dimensional, for instance an individual variable (feature) of a multi-dimensionaldataset - a column of your data matrix. The data under investigation can also be a two-dimensional distribu-tion, for instance the spatial locations of objects in a space. It can also be a three-dimensional distribution,however with increasing dimensionality density estimation becomes quickly computationally intensive aswe will learn later. In density estimation, one often seeks an adequate visual display to understand thedistribution better, which in clustering is difficult to create; and one typically identifies the modes (maxima)of the distribution.

There are two principal types of density estimation methods: non-parametric and parametric methods,Sections 12.1 and 12.2, respectively. In non-parametric methods, the distribution is merely transformed andwe typically specify a single parameter for this transformation. In parametric methods we are more explicit:we specify the number of expected densities for instance - similar to specifying k for the K-Means algorithm.

12.1 Non-Parametric MethodsAlp p165

The distribution is observed through different ’windows’ or ’local neighborhoods’, which are placed acrossthe range of the data. There are two methods of ’windowing’, see also Fig. 20. For the method of his-togramming, the windows are called bins, and all we do is to count the number of data-points that lie withina bin. This count is then illustrated in a simple bar plot (Section 12.1.1). In the method of kernel-estimation,the window is called kernel or ’Parzen window’ and we take some weighted average for the points with-in; kernel-estimation results in a smooth distribution function, as opposed to the histogramming method(Section 12.1.2).

-15 -10 -5 0 5 10 15 20

0

2

4

6

8

10 DataHistogramKernel

Figure 20: Density estimation. The data distribution is shown as black dots at y = -0.5; the data could be the values ofone variable of the data matrix D for instance. The (blue) bars represent an estimation by histogramming using a binwidth equal 1. The (green) dotted curve is an estimation using a kernel function that calculates differences betweenpoints and therefore works similar to a similarity measure.

61

12.1.1 Histogramming wiki Histogram

Histogramming is the simplest kind of density estimation - and the fastest one. When we generate ahistogram, the data are assigned to bins whose borders are called edges, the spacing between two edgesis called the bin width. In Fig. 20 a regular spacing was chosen with bin width equal one (blue bars), butunequal spacing is possible too, in which case one specifies an array of edges. The estimate is zero if noinstance falls within a bin.

Choosing the appropriate bin size can be tricky, in particular if one intends to find a description forthe distribution. A too small bin width would not generalize sufficiently, and a too large bin width wouldgeneralize too much. Setting the appropriate range boundaries is not straightforward either. The exerciseswill clarify that.

Histogramming can be done in multiple dimensions too. A two-dimensional histogram is also called bi-variate histogram, for three or more dimensions one can talk of n-dimensional histograms. Two-dimensionalhistograms can also be displayed as bar histograms, namely as columns standing in a plane. For three ormore dimensions, it becomes difficult to display the data and one needs to observe multi-dimensional dataas two-dimensional subspaces for instance to obtain an idea about the data set.

Histogramming appears so convenient and effective, one is tempted to generate this density estimatefor the entire dataset, that is for all variables of the data matrix. For a few dimensions this is possible, but thelarger the dimensionality the quicker we reach memory limits. For instance for 7 dimensions and 10 bins,one will need about one Gigabyte of memory using the single data type. Therefore, for high dimensionality,histogramming is unfeasible.

Usage in Matlab With the command histogram you can plot a histogram whereby it is possible to specifythe number of bins or an array of edges. With the command histcount you receive the actual histogramcount. Let’s assume your data DAT has been scaled already and you intend to observe the first variable andyou prefer to specify bin edges:

Edg = linspace(0,1,21)); % bin edges from 0 to 1 in steps of 0.05

H = histcounts(DAT(:,1), Edg);

bar(Edg(1:end-1), H, ’histc’);

For a two-dimensional histogram you can use the function hist3. If you need n-dimensional histogramming,then try the function in Appendix F.17.2.

12.1.2 Kernel Estimator (Parzen Windows) wiki Kernel density estimationAlp p167

ThKo p51In the method of kernel estimation, the computation involves not only counting data-points but also distancemeasurements between points within the window. The window size and the distance measurement arespecified by a function called the kernel. Those kernels (windows) can be placed equally spaced throughoutthe data range at a set of specified points x, or we can place them exactly at the individual data-points. Theexample in Fig. 20 uses equally spaced points with a spacing equal one to compare it with the histogram.At each such specified point x, the distribution xt (t = 1, .., N , N=number of datapoints) is observed by thekernel, whose behavior can be compared to a lens metaphorically speaking - very much like in a similaritymeasure (Appendix A.2): points near the center are given more ’attention’ than those in the periphery. Inother words, at each specified point x the points of the distribution xt are weighted by the kernel function K(wiki Kernel smoother). The kernel function typically takes firstly the difference between two points, one pointa selected x from the equally spaced set of points, the other point from the distribution xt. That differenceis then normalized by a parameter h called the bandwidth, that controls the width of the lens (the windowsize):

K( |x− xt|

h

)(12)

There are many different kernel functions, see for example wiki Kernel (statistics)#Kernel functions in common use.The most popular kernel function for density estimation is the Gaussian function g(x;µ, σ), see equation 21

62

(in Appendix B). The location parameter µ corresponds to variable x and the width σ corresponds to thebandwidth h.

Returning to our formulation for the density estimate: it is expressed as the function f(x), which consistsof the sum of weighted values for each xt obtained by the kernel function K with center x:

f(x) =1

Nh

N∑t=1

K( |x− xt|

h

), (13)

whereby the divisor Nh normalizes the function. The code example in Appendix F.17.1 should clarify themethod.

Kernel estimation can also be done in two or more dimensions. But the larger the dimensionality, themore computation is required. Because we need to calculate the distance between each set point x andall data points xt, the complexity quickly becomes unfeasible. It is common to perform kernel estimation forspatial coordinates for instance - that is two dimensions -, with the purpose to determine exactly the object’slocation. But for three or more dimensions it used rarely and one would rather use a clustering algorithminstead.

Usage in Matlab Matlab offers the function ksdensity which by default uses the Gaussian function as akernel so you can specify the bandwidth, i.e.

D = ksdensity(DAT(:,1), ’width’, 0.25);

but you can also omit it and the bandwidth will be estimated by a simple rule.

12.2 Parametric Methods Alp p61

Parametric means we express the distribution by parameters, that is, by an equation with more than oneparameter, which is also called the probability density function (PDF) in the context of density estimation.If we assume that our distribution has only one mode - also know as uni-modal distribution -, then thesimplest parametric description would be to take its mean µ and standard deviation σ, also know as thefirst-order statistics of the distribution. If we then intend to determine the distance of new samples to thisuni-modal distribution, then it would be convenient to employ the Gaussian function again (Equation 21).This is exactly what is done for the Naive Bayes classifier (Section 15).

If however we suspect two or more modes in our distribution, then we need an algorithm that finds us µand σ for each mode automatically. For example, in Fig. 20 we have observed with the kernel method thatthere may exist two major densities in the distribution. Of course, one could take the K-Means algorithm(Section 7) which in fact is very similar to the following approach, but the method presented here is subtlerand somewhat more precise.

12.2.1 Gaussian Mixture Models (GMM)

Here we expect that the distribution consists of multiple modes, meaning we know that there are two ormore sources giving rise to bi-modal or multi-modal distributions, respectively. The goal is then to locatethe precise position of the source and to estimate the standard deviation it introduces into the data, whichis again a case for the Gaussian function. We specify the number k of Gaussians we expect and then try tofind the corresponding centers and standard deviations, µi and σi respectively (i = 1, .., k). One thereforecalls this a Gaussian Mixture Model (GMM): the model simply adds the output of k Gaussian functions,whose means and standard deviations correspond to the location of the modes and to the width of theassumed underlying distributions, respectively.

The most popular algorithm to find the appropriate values for µi and σi, is the so-called Expectation-Maximization (EM) algorithm. The algorithm gradually approaches the optimal values by a search proce-dure that is very akin to the K-Means algorithm (Algorithm 4), hence the relation of density estimation toclustering.

63

Usage in Matlab With the function gmdistribution.fit we can find a GMM (available in statistics tool-box), for which we specify k as the minimum parameter:

Dgm = gmdistribution.fit(DAT,3);

Gm = pdf(Dgm, PtEv);

It returns a structure, called Dgm in our case, which we then pass to a function pdf that generates theGMM at points PtEv (points of evaluation). We give a full example in Appendix F.17.3, without any furtherexplanation.

12.3 Recapitulation

Density estimation is used for analyzing variables or specific distributions of low dimensionality, typically oneto three dimensions at most. Histograms are suitable for obtaining an idea of what type of distributions wedeal with, for instance observing the variables (dimensions) of a data matrix. Kernel estimation is suitable ifwe have particular data sets, spatial coordinates for instance.

For larger dimensionality density estimation becomes unfeasible however. In case of histogramming wequickly reach memory limitations for larger dimensionality; in case of the kernel-density estimation method(parametric or non-parametric), the computational complexity is the limiting factor. For large dimensionalityone rather resorts to clustering methods.

64

13 Support Vector Machines ThKo p119

A Support Vector Machine (SVM) is sometimes considered an elaboration of the linear classifier introducedpreviously. A SVM focuses on samples that are difficult to classify, somewhat akin to the hard negativemining technique mentioned before (Section 6.4.3); a SVM will ’dent’ the decision boundary based on thosedifficult examples. For that reason, a SVM typically performs better than a ’regular’ linear classifier but italso requires more tuning: its learning duration is typically much longer; and it may only work if the classesare reasonably well separable. The following characteristics make a SVM distinct from an ordinary linearclassifier:

1. Kernel function: The SVM uses such functions to project the data into a higher-dimensional space inwhich the data are hopefully better separable than in their original lower-dimensional space. Kernelfunctions are typically similarity measures, see also Appendix A.

2. Support Vectors: The SVM uses only a few sample vectors for generating the decision boundariesand those are called support vectors. For a ’regular’ linear classifier, there exist multiple reasonabledecision boundaries, that separate the classes of the training set. For instance, the optimal hyperplanein Figure 21 could actually show slightly different orientations. The SVM finds the hyperplane, thatalso gives a good generalization performance, whereby the support vectors are exploited to what iscalled ’maximizing the margin’ - the two bidirectional arrows delineate the margin.

An SVM is however only a binary classifier, taking only two classes as input. If we intend to solve a multi-class task with it, we need to construct K one-versus-other classifiers and then combine their outputs, atechnique introduced in Section 10.7 already.

13.1 Usage in Matlab

The overview in Appendix F.1 already gave two examples, one showing how to use the all-in-one function.In older Matlab version you may find the SVM under the bioinformatic toolbox: the function svmtrain trainsthe model and the function svmclassify applies the trained model on the test set.

Scaling By default the function fitcsvm will not scale the data, see again 2.3. If you intend to test yourdata with scaling, then you need to set the parameter ’standardization’ to ’true’. In previous Matlab versionsit was the other way around: scaling was default.

Kernel Function By default Matlab’s SVM uses a linear kernel (see also A). Try a different one, if youhunt an optimization of your prediction accuracy. In previous Matlab versions, the default used to be the dotproduct.

Lack of Convergence It is not rare, that the SVM learning algorithm does not converge to a solution withits default settings. The error may look something like No convergence achieved .... In that case weneed to play a bit with certain parameters. Here are three tricks:

1. Lower the box constraint parameter. The default is equal one. Try something smaller, maybe 0.95.

2. Increase the parameter KKTTolerance. The default is equal 0.001. Set to 0.005 to hopefully improve.

Specify those parameters as follows:

Svm = fitcsvm(TRN, GrpTrn, ’boxconstraint’,0.95, ’tolkkt’, 0.005);

13.2 Recapitulation

Application If a binary classification task needs to be optimized, it is definitely worth trying a SVM:chances are good your prediction accuracy will increase, but occasionally it will not. It can also be worthtrying to classify three or more classes with the one-versus-all classifier as mentioned in Section 10.7.

65

Figure 21: Training a support vector machine consists of finding the optimal hyperplane, that is, the one with themaximum distance from the nearest training patterns. The support vectors are those (nearest) patterns, a dis-tance b from the hyperplane; in this illustration there are three support vectors, two black ones and one red one.[Source: Duda,Hart,Storck 2001, Fig 5.19]

Advantages- SVMs are probably the best binary classifiers.- The SVM is robust in the sense that it does not require a feature transformation, as opposed to the linear

discriminant classifier which often requires that data are made more compact using for instance theprincipal component analysis.

Disadvantages- SVMs require parameter tuning sometimes, as opposed to linear classifiers, but probably less so than

neural networks.- The learning duration is somewhat long: it typically takes longer than a standard linear classifier (as in

Matlab function fitcdiscr), but it is still much quicker than a Deep Neural Network.- A SVM may not work well, if classes are not reasonably separable. When SVMs are used in multi-class

classification tasks, they loose somewhat their ’binary’ advantage.

66

14 Deep Neural Network (DNN) wiki Deep learning

A Deep Neural Network (DNN) is an elaboration of a traditional (artificial) neural network - like the SupportVector Machine is an elaboration of the traditional Linear Classifier. In oder to understand the idea of aDNN, we firstly sketch traditional neural networks in Section 14.1, which is also helpful to understand whythey are still being used in other classifier methodologies, for instance in ensemble classifiers (Section 10)or Support Vector Machines (Section 13).

The neural network methodology is more diverse than any other classifier methodology, in particularthere exist more learning algorithms than for any other classifier methodology. A network needs to bedesigned and a key problem is to find the appropriate network architecture, also called topology sometimes.This search for the appropriate topology is usually done purely heuristically and is therefore somewhat time-consuming.

The performance of traditional networks was not better than the performance of other classifiers for ageneral classification task. For those reasons traditional neural networks were not always taken seriously bytraditional machine learning scientists. The latest development in neural network methodology has howeverproduced networks that frequently classify datasets more accurately than any other classifiers - sometimesby a large gain. But the downside of finding the appropriate topology persists and has become even morechallenging in some cases. In Section 14.2 we introduce so-called Convolutional Neural Networks (CNNs),which are now widely used for image classification. In Section 14.3, we introduce a type of Deep BeliefNetwork (DBN), which is a very potent, general classifier.

14.1 Traditional Neural Networks

There are two principal, traditional neural networks: the Perceptron, which can be considered as the per-master of all neural networks; and the Multi-Layer Perceptron, which consists of - as the name implies - twoor more layers of Perceptrons.

output

input

hidden

input

output

Figure 22: Network topologies: circles represent neural units; straight lines represent connections between units witheach connection ’flow’ being controlled by a weight value. The input layer is usually placed at the bottom, the outputlayer is placed at the top; hence flow proceeds from bottom-to-top for classification, for learning it proceeds top-to-bottom.Left: the architecture of a (multi-class) linear classifier or Perceptron. The classification flow is considered feed-forward.In this specific case there are 4 input units and 3 output units; the unit count of the output layer often corresponds tothe number of classes to be trained. This diagram is also used for depicting the architecture of a so-called RestrictedBoltzmann Machine (RBM) in the Deep NN methodology; for the RBM the ’flow’ is more flexible than for a Perceptronbut is difficult to depict.Right: the architecture for a three-layer network, e.g. a multi-layer Perceptron (MLP). For MLPs, the classificationprocess occurs feed-forward, the training process occurs back-ward. This diagram is also used for a Deep BeliefNetwork (DBN) if the layers are trained as RBMs; in that case only the classification process is considered feed-forward;the learning process is more complex.

67

Perceptron wiki PerceptronThe Perceptron is essentially a linear classifier as introduced in Sections 4 and 15, but uses a different

learning method, namely the Perceptron learning rule. That learning rule is not very robust and is con-sidered obsolete by now, but its lack of robustness is in fact of advantage in ensemble classifiers (Section10). A Perceptron - or any Linear Classifier - can be regarded as a two-layer network consisting of an inputlayer and an output layer (Fig. 22 left). Those two unit layers hold a layer of weights, which correspondto the weight matrix as in equation 4. The Perceptron architecture in Fig. 22 is specifically a multi-classPerceptron; the equivalent linear classifier would be a so-called ’multi-class linear classifier’.

Multi-Layer Perceptron (MLP) wiki Multilayer perceptron Multi-Layer Perceptrons are stacks of Perceptrons.Typically, the term MLP refers to three layers, namely input layer, hidden layer and output layer (Fig. 22right). But a MLP can have four or more layers in principle, that is two or more hidden layers. The hiddenlayer is typically all-to-all connected (as in the Figure): each unit receives the value from all input units, andit transmits the computed value to all the units in the next layer - either the output layer or another hiddenlayer. A hidden layer is typically understood as the feature layer: it ’recognizes’ part of its input, which inlater layers is then integrated to complete the class information.

An MLP is a so-called feed-forward network because for the classification of a testing sample, the infor-mation flow propagates only forward, namely from the input layer to the hidden layer, to the next hidden layer- if present -, until the output layer. (This is in contrast to so-called recurrent networks, where informationflow occurs in loops).

MLPs can be used as kernel functions in Support Vector Machine (Section 13) - the Matlab functionsvmtrain even contains an option to use them as such. They can also be used in ensemble classifiers,very much like the Perceptrons.

There are different learning rules to train a MLP - the by far most successful one is the so-called back-propagation algorithm. As the name suggests, it works back-ward through the layers to adjust the weightsin the connection layers.

The back-propagation algorithm is often the final step in learning a DNN architecture.


The Neural Network toolbox in Matlab provides a set of commands to simulate the traditional NN, suchas the Perceptron or the MLP. Those commands start typically with the letters net (for network). With thecommand network you initialize a network. We leave it at that without further explanation, because insteadof tuning a traditional network it might be worth trying to directly tune a DNN.

14.2 Convolutional Neural Network (CNN) wiki Convolutional neural network

A Convolutional Neural Network gradually builds an abstraction of its input by firstly detecting local features,followed by a gradual assembling those local features toward more global features. The term convolutionexpresses a mathematical operation in which a a varying signal - in our case the input - is systematicallyanalyzed with a another fixed signal - in our case a set of weights favoring a specific part of the input. Ina CNN, that fixed signal is some local feature and needs to be found during the learning process. CNNsare in particular popular in the domain of image classification, where a network can consist of up to tenhidden layers; that would be a twelve-layer network! CNNs can be regarded as elaborations of Multi-LayerPerceptrons (Section 14.1).

A CNN has two types of hidden layers: feature layers and pooling layers (Fig. 23). A feature layerobserves the result of a convolution - the corresponding weight layer is also called convolutional layersometimes. A pooling layer receives a feature layer as input and merely sub-samples it in order to arriveat the desired local-to-global integration. In the sequence of layers from input to output, the feature layersand the pooling layers alternate in order to arrive at a local-to-global integration. Thus in determining thetopology of a CNN, one needs to experiment with the neighborhood size for the convolutions, as well as thesub-sampling step for the pooling layers.

68

input feature

layer

pooling

layer

outputconvolution

sub-sampling

Figure 23: A simple Convolutional Neural Network (CNN) for learning to classify images. This architecture has fourlayers: input, feature, pooling and output layer; classification flow occurs from left to right. The feature layer is some-times also called feature map: it is the result of a convolution-type scanning of the input: each unit observes a localneighborhood in the input image. The pooling layer merely carries out a sub-sampling of the feature layer. In a typi-cal CNN, there are several alternations between feature and pooling layer. The learning process tries to find optimalconvolutions that help to separate the image classes.

Unlike the hidden unit of a MLP, the hidden unit of a CNN does not receive input from all its predecessorsanymore, but only from a certain neighborhood. For instance, if the input is a 30×30 pixel image, then thefirst hidden unit would observe only the 6×6 pixel neighborhood in the upper left corner of the image; thesecond hidden unit observes the neighborhood shifted by one pixel to the right of the first one, etc. Thehidden units of a CNN cover the entire input - corresponding to a convolution. Each hidden unit wouldobserve whether its neighborhood contains a specific feature that is common across instances, for examplea straight bar, a dot, or any other geometrical structure.

The advantage of CNNs is that their classification accuracy is better than other approaches and some-times much better, which makes them the first choice when signals such as images need to be classified.But their downside is that the training duration is very long. And because it requires some time to find theappropriate topology, it can take weeks to tune such a network.

To speed up the learning process there exist two tricks. One is to use a NVIDIA graphics card withthousands of so-called CUDA cores. The other is to use a pretrained network: those consist of featuremaps, that have been trained previously already on other datasets.


Providing a code example would take a lot of space. We therefore merely point out that example code canbe found on the ’file exchange’ site of Mathworks’ website, as well as on the wikipage for CNNs (https://en.wikipedia.org/wiki/Convolutional_neural_network).

14.3 Deep Belief Network (DBN) wiki Deep belief network

A Belief Network is a network that operates with so-called conditional dependencies. A conditional depen-dence expresses the relation of variables more explicitly than just by combining them with a weighted sum.However determining the full set of parameters for such a network is exceptionally difficult. Deep Belief Net-works (DBNs) are specialized in approximating such networks of conditional dependencies in an efficientmanner, that is at least partially and in reasonable time. Popular implementations of such DBNs consist oflayers of so-called Restricted Boltzmann Machines (RBMs).

69

https://en.wikipedia.org/wiki/Convolutional_neural_network

https://en.wikipedia.org/wiki/Convolutional_neural_network

The principal architecture of a RBM is the same as for a Linear Classifier (or Perceptron; left in Fig.22), but the architecture of a RBM contains an additional set of bias weights (not shown in figure). Thoseadditional weights make learning more difficult but also more capable - they help solving those conditionaldependencies. The typical learning rule for a RBM is the so-called contrast-divergence algorithm.

The choice of an appropriate topology for the entire DBN is relatively simple. With two hidden layersmade of RBMs, one can obtain already fantastic results. A third layer rarely helps in improving classificationaccuracy. Learning in a DBN occurs in two phases. In a first phase, the RBM layers are trained individuallyone at a time in an unsupervised manner: the RBMs perform quasi a clustering process. Then, in thesecond phase, the entire network is fine-tuned with the back-propagation algorithm.

As with Convolutional NNs, a Deep Belief Network takes much time to train. However the choice of thearchitecture is easier to determine as one obtains good results with two layers already. The main advantageof this type of Deep Network is that it is fairly robust: it produces results even for difficult datasets, for whichSVMs are difficult to apply; and it often provides a classification accuracy that is similar or even better thanSVMs. Thus, some scientists prefer DBNs over SVMs.


Again, providing a code example would take too much space. An example can be found on Mathwork’swebsites.

14.4 Recapitulation

A Deep Neural Network (DNN) is definitely worth trying out, as it likely provides a better classificationaccuracy than most other classifiers in many tasks. In the domain of image classification, ConvolutionalNeural Networks (CNNs) have proven to provide the best accuracy, but finding the appropriate architectureremains a heuristic endeavor. A Deep Belief Network (DBN) made of Restricted Boltzmann layers is aclassifier as powerful as the SVM, perhaps even more powerful. It however takes a long time to train aDeep Network and if one needs qualitative results only, it is probably more convenient to use a traditionalclassifier.

Advantages- DNNs can tackle ’large-scale’ problems, which before their implementation could not be really ap-

proached. In that sense they have ushered in a new era.- Once they have been tuned properly, DNNs are fairly robust. DNNs do not require any dimensionality

reduction.

Disadvantages- The search for the appropriate architecture is a heuristic issue, in particular for CNNs.- DNNs require substantial parameter tuning sometimes; yet less so than SVMs according to my experi-

ence.- The learning duration can be terribly long. However in case of CNNs, learning can be substantially

accelerated with the acquisition of proper hardware (NVIDIA graphics card with CUDA cores).

70

15 Naive Bayes Classifier (Linear Classifier II) wiki Naive Bayes classifier

We now study a specific type of linear classifier in more detail, namely the Naive Bayes classifier. It isa fairly simple procedure, has theoretical elegance, but practically it is not quite as competitive as otherclassifiers. It maintains its place amongst the competition, when it comes to specific tasks or when samplesize is small.

The Naive Bayes algorithm assumes that the features (variables) are independent, meaning any in-formation sharing between features is not considered, which earns the name ’naive’. Yet in praxis, manydatasets contain features that are correlated to some extent, one can simply observe that with the covari-ance matrix.

One version of Naive Bayes classifier performs density estimation assuming uni-modal Gaussian dis-tributions as introduced in Section 12.2, namely by taking the mean and the standard deviations of theindividual feature dimensions for each class (group). This is also a ’naive’ assumption, because as we haveseen previously, the distribution of feature values is often far from being a Gaussian function.

Figurative Example. In our country-guessing example, we would approximate the distribution of cars for eachcountry by a separate density function and then determine our location by using the density functions only. Fora given (spatial) location we compute the values for the different countries (from their individual functions), andthe one that returns the highest value determines our choice of country.

Learning and Classifying: Taking Figure 2 as an illustration, we would fit a 2D Gaussian to each dataset- the two distributions were in fact generated with Gaussian functions (see again Appendix B). We thenclassify a new sample based on the parameters of these two Gaussian models: we compute the Gaussianfunction value for both classes and the larger function value then determines the preferred class label. Thuswe train and apply our classifier as follows:

Algorithm 8 Naive Bayes Classifier. k = 1, .., c (nclasses, K)Training ∀ c classes (∈ DL):

mean µk, covariance Σk, determinant |Σk|, inverse Σ−1k , prior P (k)→ gk as in Equation 23

Testing 1) for a testing sample x ∈ DT determine g(x) ∀ c classes→ gk.2) multiply each gk with the class prior P (k): fk = gk · P (k)

Decision chose maximum of fk: argmaxk fk

If the classes occur with uneven frequencies, we need to determine the frequency for each class, also calledprior, and include this as pointed out in the training step and in step no. 2 in the testing phase.

The Naive Bayes classifier suffers from the same problems as mentioned before already (Section 4). Itcan be difficult to compute the covariance matrix in particular for few training samples (small sample sizeproblem).

15.1 Usage in Matlab, Implementation

The usage of the all-in-one function was demonstrated in Appendix F.1 already. But implementing a NaiveBayes classifier is not so difficult, see Appendix F.18 for an example (see also ThKo p81). With det we calculatethe determinant. The computation of the inverse is preferably done with the command inv, but if the inverseis difficult to compute, for instance due to small sample size, then one can estimate the inverse with pinv.If the inverse can still not be computed, then we need to we perform a dimensionality reduction (Section 5).

We did not include the prior in this code fragment, which one can generate with

Prior = Hgrp./sum(Hgrp(:))

71

for instance, where Hgrp is the sample count for each class (histogram of group variable, see Appendix F.3).

In previous Matlab versions, the Naive Bayes classifier was implemented by the command classify andthe option ’diaglinear’ (or ’diagquadratic’).

15.2 Recapitulation

The Naive Bayes classifier was introduced for instructional purposes only. The advantages and disadvan-tages are essentially the same as mentioned in Section 4. The Naive Bayes has maintained its place inspecific applications; and it is of theoretical value in pattern recognition (see also Section 16).

72

16 Classification: Rounding the Picture & Check List DHS p84

16.1 Bayesian Formulation

A typical textbook on pattern classification (with mathematical ambition) starts by introducing the Bayesianformalism and its application to the decision and classification problem. We introduce this formalism at theend, because now it can easier understood - after we have worked with the different classifiers. Bayes’formalism expresses a decision problem in a probabilistic framework:

Bayes rule : P (ωj |x) =p(x|ωj)P (ωj)

p(x)posterior =

likelihood× priorevidence

(14)

We first explain the terms in ’natural’ language, as given on the right side of above equation: DHS p22,23

Alp p50Posterior: is the probability for the presence of a specific category ωj in the sample x.

Likelihood: is the computed value using the density function. In the example of the Naive Bayes classifier(Section 15), it is the value of Equation 23.

Prior: is the probability for the category being present in general, that is, it is the frequency of its occur-rence. We called this prior already in Algorithm 8 (Section 15.1).

Evidence: is the marginal probability that an observation x is seen - regardless of whether it is a positiveor negative example - and ensures normalization. (This was not explicitly calculated.)

Expressed formally now we say: given a sample, x, the probability P (ωj |x) that it belongs to class ωj , is thefraction of the class-conditional probability density function, p(x|ωj), multiplied by the probability with whichthe class appears, P (ωj), divided by the evidence p(x). We can formalize evidence as follows:

p(x) =

c∑j=1

p(x|ωj)P (ωj) =∑

(likelihood× prior) = Normalizer to ensure∑j

P (ωj |x) = 1 (15)

16.1.1 Rephrasing Classifier Methods

Given the above Bayesian formulation, we can now rephrase the working principle of the three classifiertypes (Sections 3, 4, 15) as follows:

k-Nearest-Neighbor (Section 3): estimates the posterior values P (ωj |x) directly, without attempting tocompute any density functions (likelihoods); in short, it is a non-parametric method, because no effortis made to find functions that approximate the density p(x|ωj).kNN is a type of instance-based learning, or lazy learning where the function is only approximatedlocally and all computation is deferred until classification.

Naive Bayes Classifier (Section 15): is essentially the simplest version of the Bayesian formulation andthat classifier makes the following two assumptions in particular:

1. It assumes that the features are independent and identically drawn (i.i.e.), in short statisticallyindependent. This is also called Naive Bayes’ Rule. But often we do not know beforehand,whether the dimensions are uncorrelated.

2. It assumes that the features are Gaussian distributed (µ ≡ ε[x],Σ ≡ ε[(x− µ)(x− µ)t])

For most data, these are two strong assumptions because most data distributions are more complex.Despite those two strong assumptions, the Naive Bayes classifier often returns acceptable perfor-mance.

Discriminative Model (Section 13): they are similar to the kNN approach in the sense that they do notrequire knowledge of the form of the underlying probability distributions. Some researchers argue,that attempting to find the density function is a more complex problem than trying to directly developdiscriminants functions.

73

16.2 Estimating Classifier Complexity - Big O Notation wiki Big O notation

We already discussed some of the advantages and disadvantages of the different classifier types in termsof their complexity. This is typically expressed with the so-called Big O notation. In short, the notationclassifies the algorithms by how they respond to changes in input size, e.g. how a change affects the pro-cessing time or working space requirements. In our case, we investigate changes in n or d (of our n×d datamatrix). The issue is too complex to elaborate here and we merely summarize here, what we mentioned sofar and what will be mentioned in later sections. For classifiers, we also make the distinction between thecomplexity during learning and the one of classifying a testing sample:

Classifier Learning Classificationk Nearest Neighbor - O(dn) [slow]Linear O(d2) O(d)Decision Tree O(d) O(d)Support Vector Machine O(n2) [slow] O(d)Deep Neural Network O(d1d2) [slowest] O(d)

Clustering AlgorithmK-Means O(ndkT )Hierarchical O(n2) [slow]

Table 3: Complexities of classification and clustering methods.d = number of dimensions.n = number of samples.k = number of clusters.T = number of repetitions.

16.3 Parametric (Generative) vs. Non-Parametric (Discriminative)

Along with the Bayesian framework comes also the distinction between parametric and non-parametricmethods, as already implied above and as it was made in Section 12. Bishop uses the terms generativeversus discriminative instead. Textbook chapters are often organized according to this distinction. See alsowiki Linear classifier#Generative models vs. discriminative models.

The parametric, generative methods pursue the approximation of density distributions p(x|ωj) by func-tions with a few essential parameters. The poster example is the Naive Bayes classifier (Section 15). It isthe preferred approach by theoreticians.

Non-parametric methods in contrast find approximations without any explicit models (and hence param-eters), such as the kNN and the Parzen window. Here we summarize the typical assignment of the methodsto those categories:

Parametric (Generative)

Naive BayesExpectation-Maximization (Section 12.2.1)(Maximum-Likelihood Estimation)...in short: multi-variate methods

Semi-parametric Clustering, i.e. K-MeansExpectation-Maximization

Non-parametric (Discriminative)

k Nearest NeighborSupport Vector MachinesDecision TreesNeural Networks (NN & DNN)

The term ’semi-parametric’ I found in Alpaydin’s textbook. The Expectation-Maximization algorithm can beclassified as parametric or non-parametric - depending on the exact viewpoint.

74

16.4 Algorithm-Independent Issues DHS p453, ch 9, pdf 531

The machine learning community tended to regard the most recently developed classifier methodology asa breakthrough in the quest of a (supposed) superior classification method. However, after decades ofresearch, it has become clear (to most researchers) that no classifier model is absolutely better than anyother one: each classifier has its advantages and disadvantages and their underlying, individual theoreticalmotivations are all justified in principle. In order to find the best performing classifier for a given problem, apractitioner simply has to test them all essentially. Here are two issues that frequently occur in debates onpattern recognition:

Curse of Dimensionality Intuitively, one would think that the more dimensions (attributes) we have at ourdisposal (through measurements), the easier it is to separate the classes (with any classifier). However,one often finds that with increasing number of dimensions, it is more challenging to find the appropriateseparability, which is also refered to as the curse of dimensionality. On the one hand, if there are irrelevantand possibly obstructive dimensions, it may indeed be better to reduce the dimensionality - as introducedwith the principal component analysis for instance. On the other hand, the clever use of kernel functions,as in Support Vector Machines, shows that more parameters can also be useful.

No Free Lunch theorem DHS p454 The theorem essentially states that no classifier technique is superior toany other one. Virtually any powerful algorithm, whether it be kNN, artificial NN, unpruned decision trees,etc. can solve a problem decently if sufficient parameters are created for the problem at hand.

75

16.5 Check List

It is easy to forget some detail that can cost you a few percent of your performance - or even get you stuck.Here are the essentials in short form:

Preparation Command Comment1. datatype single(DAT) Turn data into ’single’ to save on RAM memory; ’double’ if precision is

required2. NaN/Inf isnan, isinf Avoid inf; how many columns with NaN are there?3. feature type numeric (real) or nominal (categorical)? or both?4. permutation randperm Permute training set, in particular for SVM and DNN

Classification5. normalize i.e. zscore In rare cases normalization is detrimental to performance (!).6. group frequency hist Groups (classes) equally distributed? If not, pay attention to class im-

balance issues.7. fold crossvalind 5-fold cross validation is recommendedkNN fitcknn Simple but memory intensiveLinear fitcdiscr Simple and straightforward - may require PCA; if performance not better

than kNN, check points 2, 4, 5 and 7.Tree(s) fitctree,

TreeBagger

Useful if variables are nominal (categorical)

SVM fitcsvm Ideal for binary classification, but may require tuning; if tuning appearsimpossible, then use linear classifier instead; if prediction low (i.e. ≤80%) check prediction also with linear classifier.

Ensemble individual Worth testing if your data comes from different sources.

Frequent failures and their likely cause:- The prediction accuracy is at chance level (e.g. 50% for binary classification, 33% for three classes,...)

or it is 100%:

- Verify that the group labels are assigned properly to samples.- Check for NaN and Inf again - see item 4. under preparation.

- The prediction accuracy for training is suspiciously high: Lack of permutation perhaps - try with permuta-tion (see Section 2 or code example in F.3).

76

17 Clustering III

The two clustering methods we had introduced before, in Sections 7 and 8, are representatives of fiveprincipal types of clustering methods: the K-Means algorithm belongs to the class of so-called partitioningmethods; the hierarchical algorithms belong to the class of the same name. We elaborate on those twomethods, but we also introduce the other three types of methods ThKo p629, 12.2, HKP p448, 10.1.3.

Partitioning methods: dynamically evolve spherical clusters. The K-Means algorithm presented in Section7 is an example. Here we introduce variants that can deal with nominal data and that are less sensitiveto outliers (Section 17.1).

Hierarchical methods: there exist variants of the procedures presented in Section 8 that can deal withlarge datasets, meaning they try to beat the O(N2) complexity. Those variants - i.e. CURE, ROCK,CHAMELEON, BIRCH - will be mentioned in the subsection dealing with large databases (Section17.3).

Density-based methods: for those methods one specifies a minimum density value; the algorithm allowsto find cluster shapes that are relatively arbitrary, as opposed to the shapes evolved with partitioningor hierarchical methods. Density-based algorithms have become increasingly popular in recent timeswith DBSCAN being the most used algorithm probably; they will be be introduced in Section 17.2.

High-dimensional approaches: the larger the number of dimensions, the higher are the chances thattraditional clustering methods are not able to find the actual clusters. Section 17.4 explains why thatis, and it mentions the approaches that can deal with high dimensionality.

Methods for Very-Large Databases (VLDB): if the data is of such large sample size (high N ) that it doesnot fit into a computer’s RAM anymore, then one typically applies sub-optimal but computationally fastmethods for the sake of being able to find clusters at all. Those will be introduced in Section 17.3.

It may have become clear, that the number of clustering algorithms is much larger than the number ofclassification algorithms. We merely attempt to give an overview in order to guide the reader toward thetechniques in need, but we provide the (high-level language) code for some of the algorithms.

17.1 Partitioning Methods II

The classical K-Means algorithm enforces a cluster membership for each data point. This is also calledhard clustering sometimes, because it seeks clear-cut boundaries. But groups or classes in data are rarelywell separated and boundaries are rarely clear-cut. It can therefore be of advantage sometimes to regardboundaries as ’fuzzy’: an example is given in Section 17.1.1. Hard clustering is particularly problematicif there exist outliers in the data. In that case taking the so-called medoid is better than taking the mean:Section 17.1.2 introduces an example.

17.1.1 Fuzzy C-Means (K-Means) ThKo p712, 14.3

The clusters that we have sought so far did not intersect: a sample (observation) was assigned only toone cluster, meaning clusters were assumed to have clear boundaries. This condition may be too strict forcertain datasets in particular when some samples show characteristics of two or more groups - and notonly one. In other words, it is possible that groups could overlap: group boundaries can then said to be’fuzzy’. And this is exactly what fuzzy analysis deals with. There exists a variant of the K-Means method,which performs a fuzzy K-Means clustering. For historical reasons, the parameter variable k is called c inthat procedure, and it is therefore known as the Fuzzy C-Means algorithm.

A typical clustering algorithm returns a one-dimensional array with group labels (of length equal thenumber of observations). The Fuzzy C-Means algorithm instead, returns c arrays, where each array holdsthe proportion of membership for a group. Put differently, for each observation we have c values that expressthe likelihood to which cluster the observation belongs to; the sum of those c values is equal 1.

77

Usage in Matlab Matlab has a ’Fuzzy Logic’ toolbox which contains an implementation of the Fuzzy C-Means algorithm, see command fcm, which is applied the same way as kmeans. There is also a moreexplicit implementation with functions such as initfcm and stepfcm: one can obtain an idea how to usethose by looking at the function script irisfcm.m. In Appendix F.20 we give a compact example.

17.1.2 K-Medoids wiki K-medoidsThKo p745, 14.5.2

HKP p454, 10.2.2Because K-Means methods can be sensitive to outliers, it is sensible to try also medoids instead of means.The medoid is a representative point of the group; it is a point of the dataset and not a computed point(see wiki Medoid). The algorithm Partitioning-around-Medoids (PAM) is the most common implementationfor a K-Medoids approach. Not only can the algorithm deal better with outliers, it is in general suitable fornominal data.

Advantages- K-Medoids can also deal with discrete data - in addition to continuous data -, whereas k-means is suitable

only for continuous data.- K-Medoids is less sensitive to outliers.

Disadvantages The method is computationally more demanding than K-Means unfortunately: it has com-plexity O(k(N − k)2) and thus has almost square complexity.

Variants Because the K-Medoids algorithm has large complexity and therefore does not scale well tolarge datasets, there are some attempts to provide variants that work for large datasets as well. There existtwo variants:CLARA (Clustering LARge Applications): this method essentially corresponds to the PAM algorithm, but is

applied on a randomly selected subset of the data. The algorithm is repeated several times, in orderto find all medoids.

CLARANS (Clustering Large Applications based upon RANdomized Search): this method is a modifiedalgorithm of CLARA, that improves its random subselection.

Usage in Matlab Matlab offers the function kmedoids, which is applied like the function kmeans. By defaultit performs the PAM algorithm. It has an option for running CLARA.

17.2 Density-Based Clustering (DBSCAN) ThKo p815, 15.9

HKP p471, 10.4

In the section on Density Estimation we had introduced clustering methods that calculate a density valueat each observation (Sections 12.1.2 and 12.2.1). Those algorithms are however computationally intensiveand the following density-based method that is introduced now is simpler by carrying out a quick, initialsearch for densities, hence the name Density-Based Scan (DBSCAN).

The DBSCAN algorithm searches for regions that show higher density in comparison to their neighbor-hood by applying a user-specified threshold: no density value is calculated but merely a relational decisionis made, for instance if there are sufficient neighbors within the vicinity (Fig. 24). Thus, for that type ofclustering algorithm, one specifies two parameters: the radius (size; range) of the neighborhood under in-vestigation, called distance ε here; and the minimum number of points q that need to be present in thatneighborhood.

Procedure The algorithm randomly selects a point, then calculates the distances to all other points andthen observes how many neighbors there are within distance ε, see two examples in Fig. 24b. If thereare fewer than q neighbors, then the point is considered ’noise’ and the algorithm continues by selectinganother point randomly. If there are ≥ q neighbors, then the point is considered ’core’ and now the algorithmproceeds analyzing only those neighboring points. By identifying a core point it is assumed that one hasfound a cluster and that by searching along its neighboring core points one can recover that cluster. This

78

https://en.wikipedia.org/wiki/Medoid

search along neighboring core points has the advantage that it does not impose any restrictions on thecluster shape: the shape can be arbitrary and that aspect makes the algorithm unique from the traditionalpartitioning and hierarchical algorithms. After a cluster has been ’collected’ by such a search for neighboringcore points, the algorithm returns to the random selection of individual points until one finds the next corepoint of a different cluster.

Figure 24: Some principles of the DBSCAN algorithm.

a. A dataset: a cluster of several points placed amidstsome random points. The algorithm requires two parameters:a distance ε, which can be considered the radius of a circularneighborhood; and a minimum number of points q that arerequired to lie within that neighborhood.

b. Two example neighborhoods: the neighborhood no. 1- gray-dashed circle with diameter equal 2ε contains no neighborsand that selected point is considered a ’noise’ point; the neighbor-hood labeled 2 contains several neighbors and if their cardinalityis greater equal q then the point is considered a ’core’ point.

c. Two clusters of different density: the points on theright make up a second cluster of lower density. If one howeverapplied the same ε as in b, then all those points would beconsidered noise points and only the cluster on the left would beidentified.

a

b

c

1

2

Complexity The complexity isO(N2) in principle, but it does not require the storage of the distance matrix.Thus the algorithm is slightly simpler than hierarchical methods because only the temporal complexity isO(N2). However, for low-dimensional data the temporal complexity can be O(N log(N)) by exploiting tree-type data structures, i.e. R-tree; thus for low-dimensional data the method is suitable for large datasets.

Weaknesses The DBSCAN algorithm has the downside that it can only recover clusters of approximatelysimilar densities. If clusters show very different densities, as illustrated in Fig. 24c, it can be difficult todetect them with one set of parameter values.

79

As one may have anticipated, the challenge for this algorithm is to find the appropriate values for param-eters ε and q. Different values of the parameters may lead to totally different results: one needs to search atwo-dimensional space for the optimal values. One should select the values such that the algorithm detectsthe least dense clusters, that is one needs to try out a range of values.

Usage in Matlab Appendix F.21 gives an example of the DBSCAN algorithm.

17.2.1 Variants

There exist variants of the DBSCAN algorithm that try to address its weaknesses:

- DBCLASD (Distribution-Based Clustering of LArge Spatial Databases) ThKo p818, 15.9.2: this version is sup-posed to be able to deal with clusters of varying intensity and it requires no parameter definition: ittherefore should be able to deal with the two-cluster situation as in Fig. 24c. Its runtime is howevertwice as long as the one for the DBSCAN algorithm, but 60 times faster than the CLARANS algorithm(the K-Medoids variant in Section 17.1.2).

- DENCLUE (DENSity-based CLUEstering) HKP p476, 10.4.3, ThKo p819, 15.9.3: this version is lesser affected by data di-mensionality, meaning it has been used also for high-dimensional data such as used in Bioinformatics.

- OPTICS (Ordering Points To Identify the Clustering Structure) HKP p473, 10.4.2: this algorithm does not clusterthe points but merely orders them. It has the same complexity as DBSCAN.

17.2.2 Recapitulation

We summarize the advantages and disadvantages of density-based method in general:

Advantages- The method has the ability to recover arbitrarily shaped clusters.- It is able to handle outliers efficiently.- If low-dimensional data is employed, then the complexity can be only O(N logN).- It is in particular suitable for spatial data.

Disadvantages- DBSCAN has two parameters, meaning we search in a two-dimensional space for the optimal state.- DBSCAN can only detect clusters of similar density. The variant called DBCLASD is however capable of

recovering clusters of varying densities.- The complexity for multi-dimensional data is O(N2). There exists however a variant that supposedly has

only O(N) (search for HIERDENC)- Clusters are not as easily interpretable as in other methods.- The method is not well suited for high-dimensional data as it involves distance measurements. However

the variant called DENCLUE can deal with high-dimensional data.

17.3 Very Large Data Bases (VLDB)

There exists three principal methods to achieve clustering in very large databases (VLDB) in reasonabletime. Those methods are not really anything novel from a conceptual viewpoint, but represent rather prag-matic approaches combining different techniques:

- Incremental Mining: these are one-pass algorithms that iterate through the data points once. This methodcan also be called on-line clustering: it handles one data point at a time, and then discards it. Thealgorithm DIGNET is an example. It uses K-Means cluster representation without iterative optimiza-tion: centroids are instead pushed or pulled depending on whether they lose or win each next coming

80

point. The method strongly depends on data ordering, and can result in poor-quality clusters. How-ever, it handles outliers, clusters can be dynamically created or discarded, and the training process isresumable. This makes it very appealing for dynamic VLDB.

(If I’m not mistaken than this method corresponds to the sequential algorithms in ThKo p633, 12.2)

- Data Squashing: this method firstly generates statistical summaries of the data, thus reducing the data- either its sample size or its dimensionality. Following this reduction step, one uses a conventionalclustering algorithm. A famous example is the BIRCH algorithm (Section 17.3.2) that starts by creating’local’ clusters first. Another example are the so-called grid-based methods, which we will treat underalgorithms for high-dimensional data (Section 17.4).

- Reliable Sampling: a subset of the samples is selected, before a conventional clustering algorithm isused. The selection of a representative subset is of course the challenging issue. The algorithmCLARA is an example of this category (Section 17.1.2), as well as the algorithm CURE (see below).

In the following we discuss some variants for hierarchical methods (Section 17.3.1) and then discuss theBIRCH algorithm in more detail (Section 17.3.2).

17.3.1 Hierarchical ThKo p682, 13.5

Hierarchical clustering is by its nature not really suitable for very large datasets in particular due to thepairwise measurements between observations (Section 8.1). If one insist on using hierarchical clustering,then there exists a number of hierarchical algorithms that are geared toward dealing with large datasets.We merely summarize those:

Name CommentsCURE(sampling)

Clustering Using REpresentatives: a cluster is represented by at least 2 points.- suitable for numerical features (particularly low-dimensional spatial data)+ low sensitivity to outliers due to shrinking+ can reveal clusters with non-spherical or elongated shapes, as well as clusters of wide vari-ance in size- efficient implementation of the algorithm is possible using the heap and the k-d tree datastructures

ROCK RObust Clustering using linKs: uses links for merging clusters in place of distances- suitable for nominal (categorical) features

CHAMELEONHKP p466, 10.3.4

ThKo p686

- capable of recovering clusters of various sizes, shapes and densities in 2D data

BIRCH(squashing)

local clustering using hierarchical linking, followed by conventional clustering (Section 17.3.2)

Those algorithms are typically implemented in a lower-level language (C or one of its derivatives) wherethere exists more flexibility than in high-level languages such as Matlab. Those implementations may placepart of the interim-results on the hard-drive to cope with memory problems.

17.3.2 BIRCH wiki BIRCH HKP p462, 10.3.3

The BIRCH algorithm (Balanced Iterative Reducing and Clustering using Hierarchies) combines a varietyof techniques and consists of two principal phases. The first one carries out a local clustering using hier-archical linking and the resulting clusters are summarized as abstractions. The second phase then appliesany type of conventional clustering method to the abstracted clusters.

Phase I: A so-called clustering-feature tree is generated. A clustering feature (CF) is an object thatsummarizes a local group of points. For instance for nloc closely spaced observations xi, their centroid andtheir ’within-cluster’ variance is determined. In order to build the CF-tree efficiently, not the actual centroid

81

and variance is stored, but instead the linear sum LS =∑nloci xi and the squared sum SS =

∑nloci x2

i ,respectively. In addition, the parameter nloc is stored as well. The CF is thus an object consisting of onescalar and two vectors of length ndim:

CF = {nloc, LS, SS} (16)

By storing only two vectors, LS and SS, one saves memory, namely nloc−2 vectors are omitted from storing.To merge two clusters, one simply adds the two parameter values nloc and the respective components ofvectors LS and SS:

CF1 + CF2 = 〈nloc1 + nloc2, LS1 + LS2, SS1 + SS2〉. (17)

Example: Suppose we have a two-dimensional clustering challenge and the values of two CFs are as fol-lows CF1 = {3, (9, 10), (29, 38)} and CF2 = {3, (35, 36), (417, 440)}. If we wish to join the two clusters thenwe form the new CF as follows: CFs = {3+3, (9+35, 10+36), (29+417, 38+440)} = {6, (44, 46), (446, 478).}.

The generation of a CF-tree is controlled by two parameters, which implicitly control the height of the tree:- Branching factor B: specifies the maximum number of children per non-leaf node.- Threshold T : specifies the maximum diameter of sub-clusters stored at the leaf nodes of the tree. The

diameter can be computed from LS and SS as follows:√

2nSS−2LS2

n(n−1) .

The CF-tree is built dynamically as objects are inserted. Thus, the method is also an example of incrementalmining and not only data squashing (see introduction of this Section 17.3 again). An object is inserted intothe closest leaf entry (sub-cluster) and the diameter is recomputed: if the new diameter exceeds thresholdT , then the leaf node and possibly other nodes are split. After the insertion of the new object, informationabout the object is passed toward the root of the tree. If the CF-tree grows beyond the RAM’s size, thenparameter value T is increased and the CF-tree is rebuilt.

Phase II: A conventional clustering algorithm is applied to the leaf nodes of the CF-tree, which removessparse clusters as outliers and groups dense clusters into larger ones.

Disadvantages:- favors spherical cluster shapes due to the use of the diameter parameter- the CF tree can be simply an inappropriate summary for the final clustering result- there are two parameters (B and T ) to tune

Implementation: I am aware of the following implementation in R:http://www.inside-r.org/packages/cran/birch/docs/birch

17.4 High-Dimensional Data ThKo p821, 15.10

HKP p508, 11.2

The larger the dimensionality, the less significant become distance measurements between samples - theyprobably become useless. This is sometimes loosely called the curse of dimensionality. From what exactdimensionality on the data can be considered high-dimensional, ranges between 11 and 20, depending onthe expert’s viewpoint. For dimensionality larger than those limits, the Euclidean space has grown so largethat the value between any two points becomes almost equidistant and thus practically indistinguishable.High-dimensional data are therefore better reduced in dimensionality, if one intends to use metric distancemeasurements. One way to reduce the dimensionality is of course by use of the methods as introduced inSection 5 already, reviewed in the following subsection (Section 17.4.1).

An alternative to approach to clustering high-dimensional data is to directly look for subspaces wherethe clusters reside. This idea appears not to make use of the full dimensionality and it could imply a lossof information. But because for higher dimensionality the space is so ’vast’, samples naturally occur only insubspaces. To illustrate this reasonable assumption, think of only a 10-dimensional space: if you assumedthat there existed at least one point in each bin of a 10-bin histogram for the full dimensionality - namely a10-dimensional 10-bin histogram -, then there would have to exist at least 1010 samples. This is an unlikelyscenario (so far) and that is why the search for subspaces is a meaningful approach (Section 17.4.2).

82

http://www.inside-r.org/packages/cran/birch/docs/birch

a b

x

y

x

yC1

C2

C1

C2

Figure 25: Cluster situations and their suitability for some algorithms.a. Clusters C1 and C2 occur within two intervals of the y-axis, whereas the values along the x-axis are roughly equallydistributed. In this case, both feature selection or subspace clustering would make sense.b. Different clusters reside in different subspaces of the original feature space. This situation arises in particular forhigh-dimensional data and it is therefore a clear case for subspace clustering, whereas feature selection is far lesssuitable.

17.4.1 Dimensionality Reduction

The advantages and disadvantages of the dimensionality reduction techniques are analogous to the onesencountered for classification (Section 5):Feature generation: the most common techniques are the PCA (principal component analysis) as well as

the SVD (singular value decomposition). Those techniques are useful when a significant number offeatures contributes to the formation of clusters. The danger is that such generation may distort theclusters present in the original space; or that certain important dimensions are omitted.

Feature selection: this technique is useful when all clusters lie in the same subspace of the feature space.Figure 25a shows an example, where dimension x can be eliminated without any loss: the clusterscan be identified by only histogramming along the y-dimension.

17.4.2 Subspace Clustering HKP p510, 11.2.2

These are algorithms that search for clusters in any of the subspaces of the entire feature space. Becausethis is a combinatorial problem, the challenge is to develop algorithms that find the most relevant subspacesin reasonable time. There exist two principal techniques toward that goal. One focuses on generatinghistograms, which are called grids in this context, sketched in paragraph ’grid-based’ next. The otherfocuses on points, see paragraph ’point-based’ below.

Grid-based: HKP p479, 10.5 this approach corresponds essentially to finding clusters in multi-dimensional his-tograms (Section 12.1.1). Because it is unfeasible to generate the multi-dimensional histogram for theentire space, the strategy is to work from low to high dimensionality. One starts by generating a one-dimensional histogram for each individual feature and by identifying ’densities’ in those histograms. Thoseone-dimensional densities are then used to detect unions of densities in two dimensions, that in turn areused to detect densities in higher dimensionality. Figure 25b shows an example, where this approach wouldbe successful. We mention two implementations next, and switch now to the preferred terms grid, unit andedge size (instead of using the terms histogram, bin and bin size, respectively).

CLIQUE (CLustering In QUEst): ThKo p825 A user specifies two parameters, an edge size ξ and a densitythreshold τ ; density is determined as unit count divided by total number of samples. Firstly, a grid withedge size ξ is applied to each individual feature and those units with a density larger than τ are stored.

83

For these dense units, one then determines their unions in multiple dimensions and finds those unionsthat lie adjacent. This leads to clusters that are typically much smaller than the full dimensionality.

Advantages

- insensitive to the order of the data- does not impose any distribution or shape on the data- scales linearly with sample size

Disadvantages

- scales exponentially with dimensionality- parameters are not obvious to select- accuracy of cluster shapes can be coarse, due to the use of grids only- large overlap of clusters due to the use of unions- risk of losing small but meaningful clusters after the pruning of subspaces based on their coverage

MAFIA (Merging of Adaptive Finite IntervAls): is a variant of CLIQUE where edge size ξ is variable. Itperforms somewhat better for increasing sample size than the other grid-based algorithms.

Point-based: in this approach we select first potentially representative points and then start to growclusters. The resulting clusters are outlined more accurately than those as obtained with grid-based me-thods.

PROCLUS: ThKo p832 this algorithm borrows concepts from the K-Medoids algorithm (Section 17.1.2). Theuser specifies two parameters: a number of m clusters as well as an average dimensionality s.

ORCLUS: the nature of this algorithm is of type agglomerative as it was introduced for hierarchical algo-rithms (Section 8). The user specifies again a number of m clusters and a maximal dimensionalitys.

84

18 Clustering: Rounding the Picture

We round the subject of clustering by providing first a summary of the most common clustering procedures(Section 18.1). Typically one assumes that the data do possess some type of clusters. But if we areuncertain whether the data do actually contain clusters at all, then one should attempt to verify that: Section18.2 discusses that problem. Section 18.3 provides a check list to avoid some common mistakes.

18.1 Summary of Algorithms

Partitioning Complexity Robustness Shape #Prm Type OrderK-Means O(Nkt) hyper-ellip. 1 depK-Medoids < O(N2) outliers 1 nominal dep

CLARA rand. sel. 1Fuzzy C-Means O(Nkt) noise 1 dep

HierarchicalSingle Linkage O(N2) elongated 1Complete Linkage O(N2 logN) compact 1

DensityDBSCAN ≤ O(N2) out./noise arbitrary 2 (ε, q)OPTICS O(N logN) out./noise arbitrary

Very Large DBBIRCH (hier.) O(N) outliers compact 2 (B, T )

High-DimensionalCLIQUE (grid-based) O(N) arbitrary 2 (ξ, τ )PROCLUS (point-based) O(N) 2 (m, s)

Table 4: Summary of popular clustering algorithms.Complexity: N=number of observations (data points). rand. sel. = random subselection.Robustness: out.=outliers.Shape: cluster shapes;#Prm: number of parameters to tune.Type: preferred data type.Order: order of observations: dep=dependent on order.

To further understand the differences between the principal clustering techniques (partitioning, hierar-chical and density) we look at three example datasets in Figure 26. In a we have a ring of points that istwo points wide approximately, sitting amongst some outliers (or noise). Such a cluster is best detectedwith a density-based method, as this methods allows to detect arbitrary shapes; or with a single-linkagehierarchical method, in which case it would zig-zag through the cluster but form a ring nevertheless.

In b there is a S-shaped cluster placed in a random set of points. As the cluster forms a sequence ofpoints, the single-linkage method appears the most appropriate choice - though the density-based methodcould also detect the cluster.

In c there are several clusters sitting in noise. A K-Means algorithm would probably perform well, but itwould include the noise points (or outlier points) in the computation of the cluster centers, though one couldalso try K-Medoids or fuzzy c-means of course. Similarly, a complete-linkage hierarchical method wouldalso find the cluster centers relatively well. A density-based method would probably find the cluster centersmore accurately, as it excludes the noise points - but only if an appropriate density threshold is specified.The risk with the density-base method is that if the threshold is not specified precisely, it would fuse certainclusters, for instance in the lower left of the image center there are two clusters that appear linked by asequence of three to four points.

85

a b c

Figure 26: Three artificial two-dimensional data sets that should illustrate some of the abilities of the principal clusteringmethods. a. A ring of points: density-based appears most appropriate. b. A sequence of points: single-linkage appearsoptimal. c. A set of clusters: K-Means or density-based. (See text for more explanations.)

18.2 Clustering Tendency ThKo p896, 16.6

HKP p484, 10.6.1

What comes last now, should be the first step in a cluster analysis in principle, namely a test that verifieswhether the data contain true clusters at all. A clustering algorithm will always find some structure in thedata, even if the data is a set of random points. Thus, one should make an attempt to find out whetherthe data tend to cluster at all, before one draws strong conclusions about the data. This early test is calledclustering tendency and is typically carried out with statistical tests. And one should not only test for ran-domness, but also for regularity, hence there are three hypotheses to verify:

- Randomness hypothesis: the data are randomly distributed. This is typically the null hypothesis H0.- Regularity hypothesis: the points are regularly spaced - they are not too close to each other.- Clustering hypothesis: the data form clusters.

For two dimensions there are some tests, but for more dimensions there does not exist a general convincingtest: each one has its advantages and disadvantages. For low dimensionality one may simply visuallydisplay the data - which we recommend doing anyway -; for higher dimensionality there exists the problemthat we do not know exactly the so-called sampling window. Roughly speaking, the sampling window is therange of the data, but there exist different exact definitions. The problem is, if the window is chosen toolarge, then the data itself is interpreted as a single cluster and that would favor the clustering hypothesis.

Intuitively, one would like to analyze the distances between points: for example, we analyze the distancesbetween the data points themselves and observe their MST (minimal spanning tree); or one analyzes thedistances between a set of randomly generated points and the set of data points, Section 18.2.1 gives anexample of the latter. But as mentioned before, there is not a generally accepted test that can confirm anyhypothesis with large certainty.

In conclusion, one is left with the same type of advice that exists for any statistical test: one needs tointerpret the clustering results carefully and not rush to generalizations. More specifically, the presentedclustering techniques are only tools to arrive at a careful interpretation of the data.

18.2.1 Test for Spatial Randomness - Hopkins Test ThKo p901

In this test for spatial randomness, we measure nearest neighbor (NN) distances, once for some samplesof the entire dataset, S ∈ D, and once for a generated set of random points, R. For each sample of S wemeasure the NN distance di to its other samples in S, power it to dimensionality l, (di)

l, and integrate thosedistances: d̂own =

∑nSamplesi (di)

l. Analogously, for each sample of R we measure the NN distance di to thesamples in S: d̂rs =

∑nSamplesi (di)

l. Then, the following measure is formed:

h =d̂rs

d̂rs + d̂own(18)

86

If the pattern is a set of random points, then d̂own and d̂rs will be about the same size and h has then a valueof around 0.5. If the pattern is a set of regularly spaces points, then d̂own will be larger and the h-value willbe smaller than 0.5. If the pattern contains clusters, then d̂own will be smaller resulting in a h-value largerthan 0.5.

This test is only reliable if the set of random pointsR is exactly within the range of values in S. Otherwise,the h-values cannot be compared meaningfully. Appendix F.22 shows an example of how to apply this teston artificial data.

18.3 Check List

When starting a cluster analysis, the first two points to consider are the data size and the dimensionality:- Data size: if the data fits into a computer’s RAM and is not high-dimensional, then we can proceed with

the check list given below, otherwise we need to approach the data using only algorithms suitable forlarge datasets, see Section 17.3.

- Dimensionality: if the data is high-dimensional - from 11 dimensions on - then one should consider thetechniques introduced in Section 17.4.

Otherwise, one may proceed with the check list as given here:

Preparation Command Comment1. datatype single(DAT) Turn data into ’single’ to save on RAM memory; ’double’ if precision is

required2. NaN/Inf isnan, isinf Avoid inf; how many columns with NaN are there?3. feature type numeric (real) or nominal (categorical)? or both?4. permutation randperm Permute training set, in particular when the algorithm depends on order5. tendency? Are data NOT random? (Section 18.2)

Clustering(no) normalization Try without first. Chances are reasonable it does not help.K-Means kmeans Always try it - with different ks. It is quick and fairly powerful. For large

datasets use:Opt = statset(’MaxIter’,500,’Display’,’iter’);

Lb = kmeans(DAT,5, ’replicates’,3, ’onlinephase’,’off’,

’options’,Opt);

Hierarchical pdist,

linkage,

cluster

Watch your memory - it will take pairwise distances (O(N2))

Table 5: Check list for clustering (for data that fit into a computer’s RAM and is not high-dimensional).

87

A Distance and Similarity Measures ThKo p602, 11.2

HKP p65, 2.4

The ’spacing’ or ’separation’ between two points can be expressed as a distance or as a similarity. Thenotion of distance is perhaps more intuitive at first, because we were taught the Euclidean distance inschool. Similarity can be roughly explained as the inverse of a distance measure and will become equallyintuitive after we have dealt with certain algorithms.

How Matlab implements the following measures exactly, is detailed on their help page ClassificationUsing Nearest Neighbors. Or see the function pdist.

A.1 Distance Measures wiki Distance

• Minkowski The most used distance measure, namely the Euclidean distance, is merely one of severaluseful distance measures. The Euclidean distance, as well as some other measures, can be expressed bya single formula, namely the Minkowski metric, which is also referred to as the Lk norm:

Lk(a, b) =( d∑i=1

|ai − bi|k)1/k

(19)

a and b are two vectors of dimensionality d. For the following values of k the distance is also known as: DHS p187

k Norm Name(s) Matlab1 L1 norm Manhattan / city-block / taxi-cab distance mandist

2 L2 norm Euclidean distance dist

∞ L∞ norm Chebyshev distance in pdist

The Manhattan distance has the benefit that it calculates faster than the other metrics, as it measures onlythe sum of absolute distances - the power and root operations fall away. The Euclidean metric is a relativelycostly measure, because it squares and takes the square root; for that reason the Euclidean metric issometimes done without taking the root - called squared Euclidean then -, if the actual (Euclidean) distancevalue is not necessary.

In algorithms you often need to take the distance between one observation DAT(i,:) and all others DAT.Here is how you would calculate the differences:

Dis = sum(abs(bsxfun(@minus,DAT,DAT(i,:))), 2); % city-block

Dis = sqrt(sum(bsxfun(@minus,DAT,DAT(i,:)).^2, 2)); % Euclidean

Dis = sum(bsxfun(@minus,DAT,DAT(i,:)).^2, 2); % squared Euclidean

• Mahalanobis This is another popular distance measure. It uses a covariance matrix S to arrive at adistance value; see also Section 4.1 for covariance matrix and Appendix D for notation:

DM (x) =√

(x− µ)TS−1(x− µ) (20)

where x is a sample and µ is a mean vector; µ often represents the average over the samples for a class.The Mahalanobis measure is for example used in the Naive Bayes classifier for instance (Section 15). In thespecial case where the covariance matrix is the identity matrix (only 1s along the diagonal, 0 elsewhere),the Mahalanobis distance reduces to the Euclidean distance (L2 norm above). In Matlab: mahal.

• Hamming Is a distance that is suitable for discrete valued data. It is defined as the number of elementswhere two vectors differ. I found different exact definitions, here is one implementation:

Dis = sum(bsxfun(@ne,DAT,DAT(i,:)), 2) / nDim; % nDim = no. of variables

88

Discrete-Valued Vectors Use the L1 distance (cityblock; Manhattan) or the just mentioned Hammingdistance.

A.2 Similarity Measures wiki Similarity measure

Similarity measures are particularly used for clustering and for Support Vector Machines. In a similaritymeasure, the measure has the highest value when two vectors are identical - often defined as value equalone; the measure drops the more the two vectors differ from each other, often approaching zero for verydistant vectors.

• Dot Product (Linear) exactly as in Equation 25 (Appendix D.1). It is typically applied to normalizeddata, in which case taking the similarity values would then be simply:

Dis = 1 - (DAT * DAT(i,:)’); % assumes data are normalized!

• Cosine Similarity Is the dot product of the two vectors divided by the product of their lengths. When thismeasure is applied, then typically data are normalized to have unit length for each observation, such thatthe divisor becomes one, in which case the similarity corresponds to the dot product. The normalizationcan be done as follows:

Dnorm = sqrt(sum(DAT.^2, 2)); % length of each observation vector

DAT = DAT ./ Dnorm(:,ones(1,nDim));

Taking the similarity values is done as for the dot product above.

• Radial-Basis Function (RBF) Is typically understood as the Gaussian function (Equation 21; AppendixB) but can be any other symmetric function that decreases for increasing difference in input.

• Pearson’s Correlation Coefficient As in statistics: wiki Pearson product-moment correlation coefficient. Valuesrange from -1 to 1. It is used in the Support Vector Machine for instance.

DAT = bsxfun(@minus, DAT, mean(DAT,2)); % difference to mean

Dnorm = sqrt(sum(DAT.^2, 2)); % length of each difference vector

DAT = DAT ./ Dnorm(:,ones(1,nDim)); % E [-1 1]

Taking the similarity values is done as for the dot product above.

• Jaccard/Tanimoto Index wiki Jaccard index. There exist different definitions. Here is an example:

Bnoz = bsxfun(@or,(DAT~=0),(DAT(i,:)~=0)); % pairs with no zeros

Bdff = bsxfun(@ne,DAT,DAT(i,:)); % pairs that are different

Dis = sum(Bdff & Bnoz, 2) ./ sum(Bnoz, 2);

Discrete-Valued Vectors Use the Jaccard/Tanimoto index.

89

https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

https://en.wikipedia.org/wiki/Jaccard_index

B Gaussian Function wiki Gaussian function

The Gaussian function is a function whose shape looks like a bell. It is often used for approximating proba-bility distributions and for similarity measures. It has two parameter values: its mean µ which correspondsto the center of the distribution and its standard deviation σ which describes its width. The one-dimensionalfunction is

g(x;µ, σ) =1

σ√

2πexp(−1

2

[x− µσ

]2)(21)

In Matlab we can call the function normpdf to calculate it, for instance here we generate it for x ranging from-4 to 4, with center µ = 0 and a standard deviation σ = 1:

normpdf(-4:0.1:4, 0, 1)

In two (or more) dimensions: The Gaussian function also exists in two or more dimensions in whichcase there are are more µs and σs, namely a pair for each dimension. In two dimensions with axes x andy we can write

g(x, y) =1

2πσxσy√

1− ρ2exp

(− 1

2(1− ρ2)

([x− µxσx

]2+[y − µy

σy

]2− 2ρ(x− µx)(y − µy)

σxσy

))(22)

where µx and µy are the coordinates for the location; σx and σy the two widths of the Gaussian. In additionthere is a parameter ρ that expresses the correlation between X and Y . The standard deviations arepositive σx > 0 and σy > 0, but ρ can also be negative.

The equation is typically written more compactly, namely in matrix notation. Toward that one forms avector for the mean parameters and a matrix for the variance parameters:

µ =

[µxµy

]and Σ =

[σ2x ρσxσy

ρσxσy σ2y

]Σ is the covariance matrix as introduced in Section 4.1. Now we can write the Gaussian as follows DHS p33

g(x) =1

(2π)d/2|Σ|1/2exp

[− 1

2(x− µ)tΣ−1(x− µ)

](23)

which is also the formula for the multivariate Gaussian function wiki Multivariate normal distribution, the Gaussianfunction for two or more dimensions. This formula is still very large and it is therefore often short-noted as

N(µ,Σ). (24)

whereby N stands for normal, which is another name for the Gaussian function.

There are three noteworthy terms in equation 23:|Σ| is the so-called determinant of the covariance matrixΣ−1 is the inverse of the covariance matrix(x− µ)tΣ−1(x− µ) is also called Mahalanobis distance, see above

The determinant and the inverse are algebraic operations which are beyond the scope of this course. TheMahalanobis distance is obtained by matrix multiplications.

In Matlab the multivariate Gaussian function is implemented with the command mvnpdf. Here is a two-dimensional example of how to create it:

X = -4 :.1 : 4;

nP = length(X);

Mu = [0 0];

SGM = [.1 1.6];

[MX MY] = meshgrid(X,X);

Dlin = mvnpdf([MX(:) MY(:)], Mu, SGM);

Dimg = reshape(Dlin,nP,[]);

90

C Varia

C.1 Programming Hints for Matlab

Speed To write fast-running code in Matlab, one should exploit Matlab’s matrix-manipulating commands inorder to avoid the costly for loops, see for instance bsxfun, repmat or accumarray. Writing a kNN classifiercan be conveniently done using the repmat command. However, when dealing with high dimensionality andlarge number of samples, exploiting this command can in fact slow down computation because the machinewill spend a significant amount of time allocating the required memory for the large matrices. In that case,the code runs faster if you balance for-loops with memory-allocating commands, i.e. maintain a single forloop and use repmat for the remaining operations. Unfortunately, it is difficult to anticipate the appropriatebalance: one has to try out different combinations to arrive at the fastest implementation.

Vector Multiplication In mathematical notation a vector is assumed a column vector (see also AppendixD). In Matlab however if you define a vector as a=[1 2 3], it is a row vector - in fact as you write. Toconform with mathematical notation, either transpose the vector immediately by using the transpose sign ’

(e.g. a=[1 2 3]’;) or by using semi-colons (e.g. a=[1; 2; 3];); otherwise you are forced to change placeof the transpose sign later when applying the dot product (a*b’ instead of a’*b), in which case it appearsreverse to the mathematical notation! Or simply use the command dot, for which the column/row orientationis irrelevant.

C.2 Parallel Computing Toolbox in Matlab

Should you be lucky owner of the parallel computing toolbox in Matlab, then you can even use it on yourhome PC or laptop, as nowadays home PCs have multiple cores and that permits parallel computing inprinciple. It is relatively simple to exploit the parallel computing features in for-loops that are suitable forparallel processing: simply open a pool of cores, carry out the loop using the parfor command and thenclose the pool again.

matlabpool local 2; % opening two cores (workers)

parfor i = 1:1000

A(i) = SomeFunction(Dat, i); % the data are manipulated in some function by counter i

end

matlabpool close;

The parfor loop can not be used if your computations in the loop depend on previous results, for examplein an iterative process where A(i) depended on A(i-1). It also only makes sense if the process that issupposed to be repeated in parallel is computationally intensive, otherwise the assignment of the individualsteps to the corresponding cores (workers) may slow down the computation.

91

D Matrices (& Vectors): Multiplication and Special Matrices

There are several types of multiplications for vectors and matrices. We here summarize only the mostfrequently used ones. First we need to distinguish between the ’orientation’ of vectors, namely between rowand column vectors (see again Fig. 4):

Row vector: ’horizontal’ sequence of numbers, e.g. A =[1 5 3 −2

]. In Matlab entered as follows: A

= [1 5 3 -2];, that is as written in mathematical notation.

Column vector: ’vertical’ sequence of numbers, e.g. B =

−142

. Because such a notation is space

consuming, one can also write the column vector as a row vector with an indication at the end of thebrackets telling us that it is supposed to be a column vector. That indication is the transpose signor the letter T : B = [−1 4 2]′ or B = [−1 4 2]T . In Matlab we can write B = [-1 4 2]’; - note thetranspose sign. But we can also enter the values with semi-colon, e.g. B = [-1; 4; 2], and leavethe transpose sign away.

If you have troubles remembering the two orientations, then think of ’row of seats’ (horizontal) and ’columnsof a temple’ (vertical).

Note: In mathematical notation, a vector is assumed to be a column vector by default. It is thus recom-mended that vectors in Matlab are defined as column vectors immediately, such that multiplications in thecode appear in accordance with the mathematical notation - otherwise it can become truly confusing.

D.1 Dot Product (Vector Multiplication) wiki Dot product

In this case, the orientation of vectors (row or column) does not matter. Given two vectors of equal length,A = [A1, A2, ..., An] and B = [B1, B2, ..., Bn], the dot product is defined as the summation of their element-wise products:

A ·B =

n∑i=1

AiBi = A1B1 +A2B2 + ...+AnBn = A′B (25)

where the left side (A ·B) is the matrix notation using the dot ·; the center uses the summation notation∑

;and the right side (A′B) is the matrix notation using the transpose. This is also known as the scalar product,because the result is a single number. In Matlab you can use the command dot to obtain the product, inwhich case the order and orientation of vectors does not matter. The dot product can also be regarded as aspecial case of the matrix multiplication (coming up next), in which case the orientation of the vectors doesmatter.

D.2 Matrix Multiplication wiki Matrix multiplication

A n×m matrix A consists of n rows and m columns. It is sort of intuitive that if you add or subtract a scalarvalue from a matrix, or multiply or divide a matrix by a scalar, that this is done for each element of the matrix.It is also intuitive that if the matrices are of the exact same size, then you can perform the operations withcorresponding elements. In Matlab one uses .* and ./ to specify those element-wise operations - if not, itmay generate completely wrong results.

It is less intuitive however, how the operations are carried out when we multiply two matrices of differentsizes with each other. In order to perform such a matrix product, it requires that the number of columns ofthe first matrix is equal the number of rows of the second matrix: If A is an n×m matrix and B is an m× pmatrix, then their matrix product AB is an n × p matrix, in which the m entries across the rows of A aremultiplied with the m entries down the columns of B. Remember the expression nmmp→ np to memorizethat requirement. Let us look at the special case when m equals 1, or n and p are equal 1:

92

Product of a Row and Column Vector:

• Row * Column (nmmp=1mm1→11): this corresponds to the dot product as introduced above. In Matlab:A’*B, but only if A and B were defined as row and vector respectively.

• Column * Row (nmmp=n11p→np): creates a n × p matrix, where n and p correspond to the vectorlengths. Here, the elements are pairwise multiplied, no actual summation takes place.

The product of two matrices is then simply the application of the dot product in two loops, one iteratingthrough the rows of the first matrix and the other iterating through the columns of the second matrix. Insteadof formulating this more explicitly we give a code example, which includes several verification steps usingthe command assert:

clear;

a = [2 1 3 5]’; % column vector

b = [-1 2 0 3]’; % column vector

s = a’ * b % dot/scalar product

M = a * b’ % matrix product

s1 = dot(a,b); % dot product

s2 = dot(b,a); % order does not matter

assert(all(s==s1), ’something is wrong’);

assert(all(s1==s2), ’something is wrong’);

A = [a’; 4 7 8 -3];

B = [b [2 6 -2 5]’];

A*B % the matrix product

B*A % works too - even though we reversed the order! Why?

%% --- Add another column to A. Calculate product explicitly.

A = [A; [1 1 -1 7]];

[n m1] = size(A);

[m2 p] = size(B);

assert(m1==m2, ’Dimensionality not correct’);

Mx = nan(n,p);

for i = 1:n

a = A(i,:);

for k = 1:p

b = B(:,k);

Mx(i,k) = dot(a,b);

end

end

assert(all(all(Mx==(A*B))), ’not properly programmed’);

Notes- The loop is given merely for the purpose of illustrating the product of matrices. Of course, one would

prefer to write merely A*B in a code.- Why did then B*A work as well? [Answer: Because the size of A is equal the size of B′ (transpose).]- Observe what error you obtain when you insert the product B*A at the very end of the code again.

93

D.3 Appendix - Matrices wiki List of matrices

One of the most important matrices is the identity matrix:1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

.It is a matrix with values equal one along the diagonal entries and zero elsewhere.

Table 6: Matrices with explicitly constrained entriesName Explanation Notes, ReferencesBinary Matrix see logical matrix.Boolean Matrix see logical matrix.Diagonal Matrix A square matrix with all entries outside the main

diagonal equal to zero.Identity Matrix as introduced above.Logical Matrix A matrix with all entries either 0 or 1. Synonym for binary or Boolean matrixSparse Matrix A matrix with relatively few non-zero elements.Symmetric Mx A square matrix which is equal to its transpose,

A = AT (ai,j = aj,i).Triangular Mx A matrix with all entries above the main diagonal

equal to zero (lower triangular) or with all entriesbelow the main diagonal equal to zero (upper trian-gular).

94

E Reading

See the Section references below for publication details.

(Theodoridis and Koutroumbas, 2008): Contains a lot of practical tips - more than any other book, thatalso aims at both theory and practice. Treats clustering very thoroughly - in more depth than any othertextbook. Contains code examples for some of the algorithms.

(Han et al., 2012): The main topic is obviously data mining and therefore clustering; it is very practiceoriented. It treats the topic of data processing and preparation as well as outlier detection more elab-orately than any other book, namely in separate chapters. Its treatment of clustering is shorter than inthe book by Theodoridis and Koutroubmas, but explains some of the issues in a more straightforwardmanner - partially due to brevity. However, the book does not provide any code.

(Alpaydin, 2010): An introductory book. Reviews some topics from a different perspective than the pro-fessional, theoretical books (see below). It can be regarded as complementary to this workbook, butalso complementary to other textbooks.

(Witten et al., 2011): The most ’practical’ machine learning book probably, but rather short on the moti-vation of the individual classifier types. It accompanies the ’WEKA’ machine learning suite (see linkabove).

(James et al., 2013): A shorter but visually appealing introductory book. With some code examples forthe software package R.

(Hastie et al., 2009): Appears to be the ’parent’ of the book by James, et. al 2013: it is more elaboratethan its ’child’ but not as exhaustive as some other books.

(Duda et al., 2001): A professional, theoretical book. The book excels at relating the different classifierphilosophies and emphasizes the similarities between classifiers and neural networks. Due to its’age’ (already 15 years since the appearance of the 2nd version), it lacks in depth treatment for recentadvances such as combining classifiers and graph methods for instance.

(Bishop, 2007): Another professional, theoretical book. Contains beautiful illustrations and some his-toric comments, but aims rather at an advanced readership (upper-level undergraduate and graduatestudents).

(Martinez et al., 2010): Lovely introductory book for clustering in Matlab. Suitable for those who prefer aslower pace, but clustering is not treated in depth.

Wikipedia: Always good for looking up definitions, formulations and different viewpoints. But Wikipedia’s’variety’ - originating from the contribution of different authors - is also its shortcoming: it is hard tocomprehend the topic as a whole from the individual articles (websites). Hence, textbooks are stillirreplaceable.

ReferencesAlpaydin, E. (2010). Introduction to Machine Learning. MIT Press, Cambridge, MA, 2nd edition.Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71.

Springer.Bishop, C. (2007). Pattern Recognition and Machine Learning. Springer, New York.Duda, R., Hart, P., and Stork, D. (2001). Pattern Classification. John Wiley and Sons Inc, 2nd edition.Han, J., Kamber, M., and Pei, J. (2012). Data Mining: Concepts and Techniques. Elsevier.Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and

Prediction. Springer, New York.James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning, volume 112.

Springer.Martinez, W. L., Martinez, A., and Solka, J. (2010). Exploratory Data Analysis with MATLAB. CRC Press.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss,

R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011).Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

95

Theodoridis, S. and Koutroumbas, K. (2008). Pattern Recognition. Academic Press, 4th edition.Witten, I., Frank, E., and Hall, M. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Morgan

Kaufmann, 3rd edition.

96

F Code Examples

F.1 The Classifiers in One Script

The next two pages contain the classifier functions as applied in Matlab and Python. Note that in thoseexamples there is no explicit calling of the individual functions fit and predict; the code demonstrateshow to apply the ’wrapper’ functions that perform a systematic evaluation in a single line, namely the foldingand the averaging across folds.

clear

% --- Load the data and rename

load fisheriris % a famous data set with 150 samples and 3 classes

DAT = meas; % renaming data variable

GrpLb = species; % renaming group variable

GrpLb = grp2idx(GrpLb); % converting strings to integers

% --- Data Info:

[nSmp nFet] = size(DAT);

GrpU = unique(GrpLb);

nGrp = length(GrpU);

fprintf(’# Samples %d # Features %d # Classes %d\n’, nSmp, nFet, nGrp);

% --- Params

nFld = 5; % number of folds

%% ----- K-Nearest Neibors -------

MdCv = fitcknn(DAT, GrpLb, ’kfold’,nFld);

pcKnn = 1-kfoldLoss(MdCv);

fprintf(’K-Nearest Neighbour %1.3f\n’, pcKnn*100);

%% ----- Linear Discriminant -------


pcLD = 1-kfoldLoss(MdCv);

fprintf(’Linear Discriminant %1.3f\n’, pcLD*100);

%% ----- Naive Bayes -------

MdCv = fitcnb(DAT, GrpLb, ’kfold’,nFld);

pcNB = 1-kfoldLoss(MdCv);

fprintf(’Naive Bayes %1.3f\n’, pcNB*100);

%% ----- Decision Tree -------

MdCv = fitctree(DAT, GrpLb, ’kfold’,nFld);

pcTree = 1-kfoldLoss(MdCv);

fprintf(’Decision Tree %1.3f\n’, pcTree*100);

%% ----- Random Forest -------

MdCv = TreeBagger(100, DAT, GrpLb, ’OOBPred’,’on’);

pcRF = mean(1-oobError(MdCv));

fprintf(’Random Forest %1.3f\n’, pcRF*100);

%% ----- SVM + ErrorCorrectingOutputCode -------

MdCv = fitcecoc(DAT, GrpLb, ’kfold’,nFld);

pcSVM = 1-kfoldLoss(MdCv);

fprintf(’SVM + ECOC %1.3f\n’, pcSVM*100);

%% ----- SupportVectorMachine ------

nCat = 3;

IxF = crossvalind(’kfold’, GrpLb, nFld);

Pc = zeros(nFld,1);

for f = 1:nFld

Btst = IxF==f; % testing samples

Btrn = ~Btst;

nTst = nnz(Btst);

Grp.Tren = GrpLb(Btrn);

Grp.Test = GrpLb(Btst);

% ====== TRAIN MULTI-SVM =======

ASvm = cell(nCat,1);

for i = 1:nCat

97

Bown = Grp.Tren==i; % select class

ASvm{i} = fitcsvm(DAT(Btrn,:),Bown,’standardize’,false);

end

% ====== PREDICT =========

Post = zeros(nTst,nCat); % posteriors

for i = 1:nCat

[~,Scor] = predict(ASvm{i},DAT(Btst,:));

Post(:,i) = Scor(:,2); % 2nd column contains positive-class scores

end

[~,LbPred] = max(Post,[],2); % select highest post per sample

% ------ accuracy

pc = nnz(LbPred==Grp.Test)/nTst*100;

Pc(f) = pc;

end

fprintf(’SVM multi simple %1.3f\n’, mean(Pc));

%% ----- Ensemble AdaBoost ------

MdAda = fitensemble(DAT,GrpLb, ’AdaBoostM2’,100,’tree’,’kfold’,nFld);

pcAda = 1-kfoldLoss(MdAda,’Mode’,’Cumulative’);

fprintf(’Ensemble Boosting %1.3f\n’, pcAda(end)*100);

And now for Python. The SVM does carry out the multi-class discrimination task directly, there is no needto program a multi-classifier scheme as we did in Matlab. The SKL documentation also has a one-scriptexample comparing a number of classifiers, see page 718, section 8.4. Our example is simpler:

from sklearn import datasets, svm

from sklearn.model_selection import KFold, cross_val_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.naive_bayes import GaussianNB

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier

from numpy import shape, unique

# --- Load the data and rename

iris = datasets.load_iris()

DAT = iris.data

GrpLb = iris.target

# --- Data Info:

nSmp,nFet = shape(DAT)


nGrp = len(GrpU)

print(’# Samples ’, nSmp, ’ # Features ’, nFet, ’ # Classes ’, nGrp)

# --- Init

k_fold = KFold(n_splits=5) # prepare folds

#%% ----- K-Nearest Neibors -------

MdKnn = KNeighborsClassifier()

Pcs = cross_val_score(MdKnn, DAT, GrpLb, cv=k_fold)

print(’K-Nearest Neighbour ’, Pcs.mean()*100)

#%% ----- Linear Discriminant -------

MdDisc = LinearDiscriminantAnalysis()

Pcs = cross_val_score(MdDisc, DAT, GrpLb, cv=k_fold)

print(’Linear Discriminant ’, Pcs.mean()*100)

#%% ----- Naive Bayes -------

MdNB = GaussianNB()

Pcs = cross_val_score(MdNB, DAT, GrpLb, cv=k_fold)

print(’Naive Bayes ’, Pcs.mean()*100)

#%% ----- Decision Tree -------

MnTree = DecisionTreeClassifier()

Pcs = cross_val_score(MnTree, DAT, GrpLb, cv=k_fold)

print(’Decision Tree ’, Pcs.mean()*100)

#%% ----- Random Forest ------- p239

98

MnRF = RandomForestClassifier(n_estimators=25)

Pcs = cross_val_score(MnRF, DAT, GrpLb, cv=k_fold)

print(’Random Forest ’, Pcs.mean()*100)

#%% ----- SupportVectorMachine ------

MdSvm = svm.SVC()

Pcs = cross_val_score(MdSvm, DAT, GrpLb, cv=k_fold)

print(’SupportVectorMachine ’, Pcs.mean()*100)

#%% ----- AdaBoost Classifier ------

MdAbo = AdaBoostClassifier(n_estimators=100)

Pcs = cross_val_score(MdAbo, DAT, GrpLb, cv=k_fold)

print(’AdaBoost ’, Pcs.mean()*100)

99

F.2 The Clustering Algorithms in One Script

Matlab offers only a few basic functions, namely kmeans and clusterdata. Here we added two functions,f DbScan and f KmnsFuz, see last two sections; those functions are provided in Appendix F.21 and F.20,respectively.

clear




% --- Data Info:


fprintf(’# Samples %d # Features %d \n’, nSmp, nFet);

%% ----- Kmeans -------

LbMean = kmeans(DAT,5);

ClusSz = histcounts(LbMean,5);

fprintf(’Cluster Sizes Kmeans %2d-%2d\n’, min(ClusSz), max(ClusSz));

%% ----- Kmedoids -------

LbMed = kmedoids(DAT,5);

ClusSz = histcounts(LbMed,5);

fprintf(’Cluster Sizes Kmedoids %2d-%2d\n’, min(ClusSz), max(ClusSz));

%% ----- Hierarchical -------

LbHier = clusterdata(DAT,1.1);

LbU = unique(LbHier);

nClus = length(LbU);

fprintf(’# hierarchical clusters %2d\n’, nClus);

%% ----- Density-Based SCAN -------

LbDBS = f_DbScan(DAT, 2);

LbU = unique(LbDBS);

nClus = length(LbU);

fprintf(’# dense clusters %2d\n’, nClus);

%% ----- Fuzzy Kmeans -------

LbFuz = f_KmnsFuz(DAT,5);

ClusSz = histcounts(LbFuz,5);

fprintf(’Cluster Sizes Fuzzy Kmeans%2d-%2d\n’, min(ClusSz), max(ClusSz));

Python contains a larger number of cluster implementations but we show for the moment only the mostcommon algorithms:

from sklearn import datasets

from sklearn.cluster import KMeans, MiniBatchKMeans, Birch, DBSCAN

from sklearn.cluster import AgglomerativeClustering

from numpy import shape, unique



DAT = iris.data

# --- Data Info:


print(’# Samples ’, nSmp, ’ # Features ’, nFet)

nClus = 5 # we assume 5 clusters

#%% ----- Kmeans (Standard) ------- p748, 9.11

LbKstnd = KMeans(n_clusters=nClus).fit(DAT)

Lu,ClusSz = unique(LbKstnd.labels_,return_counts=1)

print(’Cluster Sizes Kmeans - standard ’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- Kmeans (Large Data) -------

LbKbtch = MiniBatchKMeans(n_clusters=nClus).fit(DAT)

Lu,ClusSz = unique(LbKbtch.labels_,return_counts=1)

print(’Cluster Sizes Kmeans - large data’, min(ClusSz), ’-’, max(ClusSz))

100

#%% ----- Hierarchical ------- p734, 9.5

LbAgglo = AgglomerativeClustering(n_clusters=nClus, linkage=’ward’).fit(DAT)

Lu,ClusSz = unique(LbAgglo.labels_,return_counts=1)

print(’Hierarchical ’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- DBSCAN ------ p753, s9.13

LbDBSCAN = DBSCAN(eps=0.8, min_samples=10).fit(DAT)

Lu,ClusSz = unique(LbDBSCAN.labels_,return_counts=1)

print(’DBSCAN ’, min(ClusSz), ’-’, max(ClusSz))

#%% ----- Birch ------

LbBirch = Birch(threshold=0.6, n_clusters=10).fit(DAT)

Lu,ClusSz = unique(LbBirch.labels_,return_counts=1)

print(’Birch ’, min(ClusSz), ’-’, max(ClusSz))

101

F.3 Prepare Your Data

Some preparatory steps to inspect and adjust your data. The following adjustments do not necessarilymake sense, it is meant as example code to understand how to apply the functions.

clear % clear memory

close all % close all figures

load fisheriris % famous data set provided by Matlab




% --- Data info

[nSmp nFet] = size(DAT); % size of data

GrpU = unique(GrpLb); % the group/class/category labels

nGrp = length(GrpU); % # of groups/classes/cats

Hgrp = histcounts(GrpLb,nGrp); % sample count per class


tabulate(GrpLb); % group analysis in one function

%% -----Inspect visually

figure; imagesc(DAT); colorbar(); % display as image

figure; boxplot(DAT); % box plot

figure; hist(DAT(:,1)); % plot histogram of 1st feature

figure; bar(Hgrp); title(’Sample Count per Class’);

%% ----- Introduce some artificial irregularities

DAT(1:2,1) = NaN; % set first two values to NaN

DAT(5:15,2) = NaN;

DAT(100:150,4) = NaN;

DAT(3,1) = inf; % set one value to infinity

DAT = [DAT ones(nSmp,1)*2]; % add a column of 2s

%% ----- Check for NaN, Inf, zero-standard deviation

Cnan = sum(isnan(DAT),1); % NaN count per feature

Cinf = sum(isinf(DAT),1); % inf count per feature

PropNaN = Cnan / nSmp; % proportion NaN per feature

if any(PropNaN) % plot only if there are any NaNs

figure(); bar(PropNaN);

end

Bzstd = std(DAT,[],1) < eps; % features with 0 standard deviation

if any(Bzstd),

warning(’the following feature dimensions have constant values’);

find(Bzstd)

end

%% ----- Adjust Data ------

BnoNaN = not(logical(Cnan));

DATred = DAT(:,BnoNaN); % reduced data

DAT(isinf(DAT)) = realmax; % use max to replace inf

%% ----- Standardization -----

% watch out: turns any dimension with some NaNs to NaNs only

DATstd = zscore(DAT); % one-standard devation

std(DATstd,[],1) % display to verify

%% ----- Scale to Unit Range ------

DAT = bsxfun(@plus, DAT, -min(DAT,[],1)); % set minimum to 0

DAT = bsxfun(@rdivide, DAT, max(DAT,[],1)); % now we scale to 1

max(DAT,[],1) % display maxima to verify

min(DAT,[],1) % display minima to verify

%% ----- Permute Data --------

% necessary for some classifiers such as NeuralNetworks

IxPerm = randperm(nSmp); % randomize order of training samples

DAT = DAT(IxPerm,:); % reorder training set

GrpLb = GrpLb(IxPerm); % reorder group variable

%% ---- Create Folds -------

102

IxFld = crossvalind(’kfold’, GrpLb, 5);

from numpy import shape, unique, concatenate, ones

from numpy import nan, isnan, inf, isinf, arange, logical_not, random

from matplotlib.pyplot import bar, imshow, figure, colorbar, boxplot, hist


from sklearn.preprocessing import scale, minmax_scale

from sys import float_info



DAT = iris.data

GrpLb = iris.target

# --- Flip Data if necessary

#DAT = DAT.transpose()

# --- Data Info:



nGrp = len(GrpU)

Hgrp = hist(GrpLb, nGrp)


#%% -----Inspect visually

figure(figsize=(4,40)); imshow(DAT); colorbar()

figure(); boxplot(DAT)

figure(); hist(DAT[:,2])

#%% ----- Introduce some irregularities

DAT[0:2,0] = nan

DAT[4:14,1] = nan

DAT[100:150,3] = nan

DAT[2,0] = inf; # set one value to infinity

DAT = concatenate((DAT, ones((nSmp,1))*2), axis=1) # add a column of 2s

#%% ----- Check for NaN, Inf, zero-standard deviation

Cnan = isnan(DAT).sum(axis=0) # NaN count per feature

Cinf = isinf(DAT).sum(axis=0) # inf count per feature

PropNaN = Cnan / nSmp # proportion NaN per feature

if PropNaN.any(): # plot only if there are any NaNs

bar(arange(0,len(PropNaN)), PropNaN)

Bzstd = DAT.std(axis=0) < .000000001 # features with 0 standard deviation

if Bzstd.any():

print(’the following feature dimensions have constant values’)

print(Bzstd.nonzero())

#%% ----- Adjust Data ------

BnoNaN = logical_not(Cnan)

DATred = DAT[:,BnoNaN]

DATred2 = DAT.compress(BnoNaN,axis=1) # excluding columns

DAT[isinf(DAT)] = float_info.max # use max to replace inf

#%% ----- Standardization -----

# we go to reduced data because ’scaler’ cannot deal with NaN

DATstd = scale(DATred) # one-standard devation

print(DATstd.std(axis=0)) # display to verify

#%% ----- Scale to Unit Range ------

DATu = minmax_scale(DATred) # scaling to E [0 1]

print(DATu.max(axis=0)) # display maxima to verify

print(DATu.min(axis=0)) # display minima to verify

#%% ----- Permute Data --------

# necessary for some classifiers such as NeuralNetworks

IxPerm = random.permutation(nSmp); # randomize order of training samples

DAT = DAT[IxPerm,:] # reorder training set

GrpLb = GrpLb[IxPerm] # reorder group variable

103

#%% ---- Create Folds -------

#IxFld = crossvalind(’kfold’, GrpLb, 5);

F.3.1 Whitening Transform wiki Whitening transformation

Input: DAT, a n× d matrixOutput: DWit, the whitened data.

CovMx = cov(DAT); % covariance -> [nDim,nDim] matrix

[EPhi ELam] = eig(CovMx); % eigenvectors & -values [nDim,nDim]

Ddco = DAT * EPhi; % DECORRELATION

LamS = ELam.^(-0.5);

LamS = diag(diag(LamS)); % ensure it’s a diagonal matrix

DWit = Ddco * LamS; % EQUAL VARIANCE

% verify

COVwhi = cov(Ddco); % covariance of decorrelated data (should be a diagonal matrix)

Df = diag(ELam)-diag(COVdco); % difference of diagonal elements

if sum(Df)>0.1, error(’odd: differences of diagonal elements very large!?’); end

See also http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

F.3.2 Loading and Converting Data

clear; close all;

addpath(’c:/Data/’); % adds a folder path to the variable path

%% ---- Import a Single File

DAT = importdata(’filename for data’);

Grp = importdata(’filename for class/group label’);

DAT = single(DAT); % if you do not need double precision

sfp = ’c:/DataMat/DatPrep’; % where data will be save to

save(sfp,’DAT’,’Grp’); % will be save in compact matlab format

%% ---- Import Multiple Files

FOLD.DatRaw = ’c:/DataRaw/’; % folder with different data files

FOLD.DatSave = ’c:/DataMat/’; % folder where we save converted data

FilesAndDir = dir(FOLD.DatRaw); % includes (’.’ and ’..’)

FileNames = FilesAndDir(3:end); % omit first two dir (’.’ and ’..’)

nFileNames = length(FileNames);

DAT = zeros(nFileNames,nDim); % nDim: # of dimensions - if known already

Grp = zeros(nFileNames,1);

for i = 1:nFileNames

fp = [FOLD.DatRaw FileNames(i).name]; % +2: jump ’.’ and ’..’

F = load(fp); % a feature vector

DAT(i,:) = F; % assign to DAT matrix

Grp(i) = label; % assign to group vector

end

%% --- Now Save

save(sfp, ’DAT’, ’Grp’);

104

http://courses.media.mit.edu/2010fall/mas622j/whiten.pdf

F.3.3 Loading the MNIST dataset

Note that this function contains two subfunctions, ff LoadImg and ff ReadLab.

% Loads MNIST data and converts them from ubyte to single.

%

function [TREN LblTren TEST LblTest] = LoadMNIST()

filePath = ’C:\DatOrig\MNST\’;

Filenames = cell(4,1);

Filenames{1} = [filePath ’train-images.idx3-ubyte’];

Filenames{2} = [filePath ’train-labels.idx1-ubyte’];

Filenames{3} = [filePath ’t10k-images.idx3-ubyte’];

Filenames{4} = [filePath ’t10k-labels.idx1-ubyte’];

TREN = ff_LoadImg(Filenames{1});

LblTren = ff_ReadLab(Filenames{2});

TEST = ff_LoadImg(Filenames{3});

LblTest = ff_ReadLab(Filenames{4});

TREN = single(TREN)/255.0;

TEST = single(TEST)/255.0;

LblTren = single(LblTren);

LblTest = single(LblTest);

end % MAIN FUNCTION

%% ========== Load Digits

function IMGS = ff_LoadImg(imgFile)

fid = fopen(imgFile, ’rb’);

idf = fread(fid, 1, ’*int32’,0,’b’); % identifier

nImg = fread(fid, 1, ’*int32’,0,’b’);

nRow = fread(fid, 1, ’*int32’,0,’b’);

nCol = fread(fid, 1, ’*int32’,0,’b’);

IMGS = fread(fid, inf, ’*uint8’,0,’b’);

fclose( fid );

assert(idf==2051, ’%s is not MNIST image file.’, imgFile);

IMGS = reshape(IMGS, [nRow*nCol, nImg])’;

for i=1:nImg

Img = reshape(IMGS(i,:), [nRow nCol])’;

IMGS(i,:) = reshape(Img, [1 nRow*nCol]);

end

end % SUB FUNCTION

%% ========== Load Labels

function Lab = ff_ReadLab(labFile)

fid = fopen(labFile, ’rb’);

idf = fread(fid, 1, ’*int32’,0,’b’);

nLabs = fread(fid, 1, ’*int32’,0,’b’);

ind = fread(fid, inf, ’*uint8’,0,’b’);

fclose(fid);

assert(idf==2049, ’%s is not MNIST label file.’, labFile);

Lab = zeros(nLabs, 10);

ind = ind + 1;

for i=1:nLabs

Lab(i,ind(i)) = 1;

end

end % SUB FUNCTION

105

F.4 Utility Functions

F.4.1 Calculating Memory Requirements

% Calculate memory requirements for a data matrix with nEnt entries

%

% IN nEnt number of entries, typically nPoints * nDimensions

% OUT Gb number of GigaBytes

%

function Gb = f_GbSingle(nEnt)

Gb = nEnt*4/(1024^3);

fprintf(’%.3f Gb ’, Gb);

end

106

F.5 Classification Example - kNNThe code serves two purpose: to show how to apply the kNN algorithm and how cross-validate with in-creasing degree of explicitness. The first section labeled ’All-In-One Function’ contains the classificationand folding in a single line - as demonstrated already in F.1. The second section labeled ’Folding MoreExplicit’ calls a separate function crossval for folding. The third section ’Folding Done Yourself’ shows howto be very explicit with folding and we now make use of functions fitxxx and predict for the individualfolds. The remaining two sections explain how to move to a kNN implementation of one’s own, the first onerelying on knnsearch, the second (very last one) how to completely program it yourself.clear

load fisheriris % famous data set provided by Matlab





nNN = 5; % number of nearest neighbors

%% ========= All-In-One-Function =============

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN, ’kfold’,nFld);

pc = 1-kfoldLoss(Mdl); % percent correct = 1-error

fprintf(’Perc correct %1.2f (all-in-one)\n’, pc*100);

%% ========= Folding More Explicit =============

Mdl = fitcknn(DAT, GrpLb, ’NumNeighbors’,nNN);



fprintf(’Perc correct %1.2f (folding more explicit)\n’, pcf*100);

%% ========= Folding Done Yourself =============

IxFlds = crossvalind(’kfold’, GrpLb, nFld);

Pc =[];

for i = 1:nFld

% --- Prepare fold

Btst = IxFlds==i; % logical vector identifying testing samples

Btrn = ~Btst; % logical vector identifying training samples

Grp.Tren = GrpLb(Btrn); % select group labels for testing

Grp.Test = GrpLb(Btst); % select group labels for training

nTst = length(Grp.Test);

% --- Test fold

Mdl = fitcknn(DAT(Btrn,:), Grp.Tren, ’NumNeighbors’,nNN);

LbPred = predict(Mdl, DAT(Btst,:));

Bhit = LbPred==Grp.Test; % binary vector with hits equal 1

pc = nnz(Bhit)/nTst*100;

fprintf(’Fold %d, pc %1.2f\n’, i, pc);

Pc(i) = pc;

end

fprintf(’Perc correct %1.2f (folding done yourself)\n’, mean(Pc));

%% ========= Using Matlab’s knnsearch =============

nCls = 3; % # of classes

Btst = IxFlds==2; % choosing fold 1

TRN = DAT(~Btst,:); % training for fold 1

TST = DAT(Btst,:); % testing for fold 1



[IXNN Dist] = knnsearch(TRN, TST, ’k’,nNN);

GNN = Grp.Tren(IXNN); % indices to group labels

HNN = histc(GNN, 1:nCls, 2); % histogram for 5 NN

[Fq LbPred] = max(HNN, [], 2); % LbTst contains predicted classes

107


fprintf(’Perc correct %1.2f (knnsearch)\n’, nnz(Bhit)/nTst*100);

%% ========= Own Implementation =============

nTrn = size(TRN,1); % # of training samples

nTst = size(TST,1); % # of testing samples

GNN = zeros(nTst,11); % we will analyze 11 nearest neighbors

for i = 1:nTst

iTst = repmat(TST(i,:), nTrn, 1);% replicate to same size [nTrn nDim]

Diff = TRN-iTst; % difference [nTrn nDim]

Dist = sum(abs(Diff),2); % Manhattan distance [nTrn 1]

[~, O] = sort(Dist,’ascend’); % increasing dist for k-NN

GNN(i,:)= Grp.Tren(O(1:11)); % closest 11 samples

end

% --- Quick Knn analysis

HNN = histc(GNN(:,1:nNN), 1:nCls, 2); % histogram for 5 NN

[Fq LbPred] = max(HNN, [], 2); % LbTst contains predicted classes


fprintf(’Perc correct %1.2f (own)\n’, nnz(Bhit)/nTst*100);



from sklearn.neighbors import KNeighborsClassifier

from numpy import shape, unique, zeros



DAT = iris.data

GrpLb = iris.target

# --- Data Info:



nGrp = len(GrpU)


# --- Init


nNN = 5 # number of nearest neighbors

#%% ========= All-In-One Function =============

MdKnn = KNeighborsClassifier(nNN)

Pcs = cross_val_score(MdKnn, DAT, GrpLb, cv=k_fold)

print(’Perc correct ’, Pcs.mean()*100, ’(all-in-one)’)

#%% ========= Folding Done Yourself =============

Pc = zeros((5,1))

i = 0

for IxTrn, IxTst in k_fold.split(DAT):

#print(’Train: %s | test: %s’ % (IxTrn, IxTst))

TRN = DAT[IxTrn,:]

TST = DAT[IxTst,:]

MdKnn.fit(TRN, GrpLb[IxTrn])

LbPred = MdKnn.predict(TST)

Bhit = LbPred==GrpLb[IxTst]

pc = Bhit.sum()/len(IxTst)*100

Pc[i] = pc

i = i+1

print(’Fold ’, i, ’ pc ’, pc)

108

print(’Perc correct ’, Pc.mean(), ’ folding done yourself’)

F.5.1 kNN Analysis Systematic

This should run as a continuation of the above Matlab example.

%% --- Systematic Knn analysis

kNN = [3:2:11];

nNN = length(kNN); % number of NN we are testing

Pc = zeros(nNN,1); % init array

c = 0; % counter

for k = kNN

HNN = histc(GNN(:,1:k), 1:nCls, 2); % histogram for 5 NN

[Fq LbPred] = max(HNN, [], 2); % LbTst contains class assignment

Bhit = LbPred==Grp.Test;

c = c + 1;

Pc(c) = nnz(Bhit)/nTst*100;

end

figure(3);clf;

plot(kNN, Pc, ’*-’); title(’Perc correct for different NN’);

xlabel(’k (# of NN)’);

ylabel(’Perc Correct’);

set(gca,’ylim’,[0 100]);

F.6 Estimating the Covariance Matrix

clear;

D = randn(10,3);

[nO nDim] = size(D); % # observations/dimensions

Mn = mean(D,1); % mean

Dc = bsxfun(@minus, D, Mn); % data - mean

Cv = (Dc’ * Dc) / (nO-1); % covariance

%% ---- Verification

Vnc = var(D,[],1); % variance per dimension

Vnc-Cv(diag(true(nDim,1)))’

Cv2 = cov(D);

Cv-Cv2

109

F.7 Classification Example - Linear Classifier

This is the same Matlab code as in the kNN example (Appendix F.5), except that the function fitcdiscr isused instead of fitcknn. The explicit folding is not shown anymore.

clear

load fisheriris





%% ========= All-In-One Function =============


pc = 1-kfoldLoss(MdCv); % percent correct = 1-error

fprintf(’Perc correct %1.2f (all-in-one)\n’, pc*100);

%% ========= Folding More Explicit =============

Mdl = fitcdiscr(DAT, GrpLb);



fprintf(’Perc correct %1.2f (folding more explicit)\n’, pcf*100);

This is also the same Python code as in the kNN example (Appendix F.5), except that the functionLinearDiscriminantAnalysis is imported instead of KNeighborsClassifier. In this example we showhowever one more time how to fold explicitly.



from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from numpy import shape, unique, zeros



DAT = iris.data

GrpLb = iris.target

# --- Data Info:



nGrp = len(GrpU)


# --- Init


#%% ========= All-In-One Function =============

Mdl = LinearDiscriminantAnalysis()

Pcs = cross_val_score(Mdl, DAT, GrpLb, cv=k_fold)

print(’Perc correct ’, Pcs.mean()*100, ’(all-in-one)’)

#%% ========= Folding Done Yourself =============

Pc = zeros((5,1))

i = 0

for IxTrn, IxTst in k_fold.split(DAT):

#print(’Train: %s | test: %s’ % (IxTrn, IxTst))

TRN = DAT[IxTrn,:]

TST = DAT[IxTst,:]

Mdl.fit(TRN, GrpLb[IxTrn])

LbPred = Mdl.predict(TST)

Prob = Mdl.predict_proba(TST) # returns probabilities [nSmp nFet]

110

Bhit = LbPred==GrpLb[IxTst]

pc = Bhit.sum()/len(IxTst)*100

Pc[i] = pc

i = i+1

print(’Fold ’, i, ’ pc ’, pc)

print(’Perc correct ’, Pc.mean(), ’ folding done yourself’)

111

F.8 Study Cases for PCA

clear;

%% === Example 1: increasing values, noisy

Dat1 = sort(rand(100,50)*50, 2); % increasing values

[coeff1 score1 lat1] = pca(Dat1);

figure(2); clf;

subplot(2, 2, 1); plot(Dat1’);

subplot(2, 2, 2); plot(coeff1(:, 1:3), ’linewidth’, 2); legend(’1’, ’2’, ’3’, ’location’, ’best’);

subplot(2, 2, 3); plot(lat1, ’*-’);

subplot(2, 2, 4); plot(coeff1(:, end-2:end));

%% === Example 2: straight lines with vertical offset

for i = 1:100,

Dat2(i, :) = [1:50]*rand*5;

end


figure(3); clf;





%% === Example 3: Sigmoids

for i = 1:100

Dat3(i, :) = normcdf([1:50], rand*2+22, 5)*(rand*0.1+1); % x, mu, sigma + rand(1, 50)/10;

end


figure(4); clf;





%% === Example 4: Just noise

Dat4 = randn(100, 50)*20;


figure(5); clf;





112

F.9 Example Ranking Features

% Create 10-dimensional data of which we make 6 significantly different.

% Function rankfeatures should identify those 6 significant ones.

clear; rng(’default’);

nP = 250; % # of points

nDim = 10;

% --- generate groups

Grp = round(rand(1,nP))+1;

Hg = hist(Grp,1:2);

% --- generate data

DAT = randn(nP,nDim); % normal (Gaussian) noise

Bg1 = Grp==1; % identify one group (logical vector)

Perm = randperm(nDim); % permutation

IxSig = Perm(1:6); % select 1st 6 and...

DAT(Bg1,IxSig) = DAT(Bg1,IxSig)+1; % ...make those significantly different

%% ===== Ranking ======

% Note: the function rankfeatures expects the data matrix DAT flipped!

[Otts Tts]= rankfeatures(DAT’, Bg1, ’criterion’, ’ttest’);

[Oent Ent]= rankfeatures(DAT’, Bg1, ’criterion’, ’entropy’);

[Oroc Roc]= rankfeatures(DAT’, Bg1, ’criterion’, ’roc’);

[Owcx Wcx]= rankfeatures(DAT’, Bg1, ’criterion’, ’wilcoxon’);

%% ----- Normalize & Plot

Tts = Tts / sum(Tts);

Ent = Ent / sum(Ent);

Roc = Roc / sum(Roc);

Wcx = Wcx / sum(Wcx);

figure(1);clf;

bar([Tts Ent Roc Wcx]);

legend(’t-Test’, ’Entropy’,’ROC’,’Wilcoxon’);

xlabel(’Dimension No.’);

113

F.10 Function k-Fold Cross-Validation

% Generates labels for a k-fold cross-validation.

% taken from Matlab’s crossvalind.

% IN k # of folds

% Grp group variable [nSmp 1]. assumes E [1..k]

% OUT Fld vector of fold labels [nSmp 1], E [1..k]

% IxF list of indices into Fld

%

function [Fld IxF] = f_IxCrossVal(Grp, nFld)

% --- verify Group vector

Gu = unique(Grp);

nGrp = length(Gu);

assert(Gu(end)==nGrp,’Group variable not suitable: use 1,..,nGrp’);

%% ----- LOOP Groups

nSmp = length(Grp);

Fld = zeros(nSmp,1);

for g = 1:nGrp

IxGrp = find(Grp==g);

nMem = length(IxGrp); % # of members

PermMem = randperm(nMem); % permute them

IxMem = ceil(nFld*(1:nMem)/nMem); % fold indices

% and permute them to try to balance among all groups

PermFld = randperm(nFld); % permute the folds in order to balance

% randomly assign the id’s to the observations of this group

Fld(IxGrp(PermMem)) = PermFld(IxMem);

end

%% ----- List of indices

IxF = cell(nFld,1);

for i = 1:nFld

IxF{i} = find(Fld==i);

end

end % MAIN

114

F.11 Example ROC

Calculates the ROC curve using a for-loop for instructional purposes (the next page contains a functiondoing this using matrix commands). Here, three overlapping distributions are analyzed, each one a bitmore separated than the previous one, see line Dat = [Sig-i; Bkg];, where i creates the separation.

clear all; close all;

% Generate data:

nSig = 20;

nBkg = 40;

Sig = randn(nSig,1); % signal

Bkg = randn(nBkg,1); % background

% Generate labels: 1=signal, 2=background

LbSig = ones(nSig,1);

LbBkg = ones(nBkg,1)*2;

Lb = [LbSig; LbBkg];

ntSmp = nSig+nBkg; % # of total samples

%% -------- 3 Different Degrees of Separation

Auc = [];

for i = 1:3

Dat = [Sig-i; Bkg]; % final data for this cycle

% ===== Moving threshold

[aTPR aFPR c] = deal([],[],0);

figure(1);subplot(2,2,1); cla; % clear axis

Th = unique(Dat)’; % generating thresholds

for t = Th

c = c+1; % increase counter

bLrg = Dat <= t; % decision

% ===== Evaluate

bHit = bLrg & Lb==1; % true pos / hits

bFaA = bLrg & Lb==2; % false pos / false alarms

aTPR(c) = nnz(bHit)/nSig;

aFPR(c) = nnz(bFaA)/nBkg;

% --- Plotting

figure(1);

subplot(2,2,1);

plot(Sig,zeros(nSig,1),’g.’); hold on;

plot(Bkg,zeros(nBkg,1),’r.’);

plot([t t], [0 0],’k*’);

subplot(2,2,2);

plot(aFPR,aTPR, ’b.-’); hold on;

set(gca,’xlim’, [0 1], ’ylim’, [0 1]);

% pause();

end

% Area under the curve:

aTPRmid = aTPR(1:end-1)+diff(aTPR)/2; % interpolated mid points

Auc(i) = 0.5 + abs(0.5 - aTPRmid * diff(aFPR’) );

end

%% ------ Area under the Curve Values

subplot(2,2,4);

bar(Auc);

115

F.11.1 ROC Function

Calculates the ROC curve using matrix commands only.

% ROC curve for a signal and noise distribution and the corresponding value

% Area-under-the-Curve.

%

% IN Dsrb distribution with signal and noise (or one cat vs another)

% Bsig logical vector with points == 1 where the signal points are

% OUT C ROC curve [nPtsUnique 2]

% auc area under the curve

%

function [C auc] = f_RocCrv(Dsrb, Bsig)

if isrow(Bsig), Bsig = Bsig’; end % flip to make it column vector

nPsig = nnz(Bsig); % # of signal points

nPnos = length(Bsig)-nPsig; % # of noise points

[~,O] = sort(Dsrb); % sort the distribution

BsigO = Bsig(O); % re-order signal labels

Tpr = cumsum(BsigO) /nPsig; % true positive rate (hits)

Fpr = cumsum(~BsigO)/nPnos; % false positive rate (false alarms)

C = [Fpr Tpr]; % the ROC curve

% --- Area under the curve:

TprMid = Tpr(1:end-1)+diff(Tpr)/2; % interpolated mid points

auc = 0.5 + abs(0.5 - TprMid’ * diff(Fpr) ); % avoid below 0.5

end

116

F.12 Clustering Example - K-Means

The overview in Appendix F.2 showed how to apply the software’s function; here is an explicit implementa-tion of the principle.

clear;

%% ---- Artificial Dataset

nP = 20;

X = [randn(nP,2)+ones(nP,2); randn(nP,2)-ones(nP,2)];

nP = size(X,1);

nCls = 2;

%% === Using kmeans

[Lb CtrMb] = kmeans(X, nCls, ’dist’,’city’, ’rep’,5, ’disp’,’final’);

% ---- Cluster info

IXC = cell(nCls,1);

for i = 1:nCls

IXC{i} = find(Lb==i);

end

%% === Implementation of the Principle

IxCtr = randsample(nP,2);

Ctr = X(IxCtr,:); % initial centroids

D = zeros(size(X));

minErr = 0.1;

mxIter = 100;

for i = 1:mxIter

% === Distances Centroid to All

for c = 1:nCls

ctr = Ctr(c,:); % one centroid [1 nDim]

Df = bsxfun(@minus, ctr, X); % difference only

D(:,c) = sum(Df.^2,2); % square suffices (we don’t root)

end

% === Find Nearest

[v IxMin] = min(D,[],2);

% === Move Centroid (new location)

for c = 1:nCls

Ctr(c,:) = mean(X(IxMin==c,:));

end

end

% ---- Cluster info

IXC2 = cell(nCls,1);

for i = 1:nCls

IXC2{i} = find(IxMin==i);

end

%% ---- Plotting

figure(1); clf;

M = colormap;

Mr = M(randsample(64,64),:);

subplot(1,2,1); hold on;

for i = 1:nCls

plot(X(IXC{i},1),X(IXC{i},2),’.’, ’color’, Mr(i,:));

end

plot(CtrMb(:,1),CtrMb(:,2),’kx’);

plot(Ctr(:,1),Ctr(:,2),’ro’);

subplot(1,2,2); hold on;

for i = 1:nCls

plot(X(IXC2{i},1),X(IXC2{i},2),’.’, ’color’, Mr(i,:));

end

plot(CtrMb(:,1),CtrMb(:,2),’kx’);

plot(Ctr(:,1),Ctr(:,2),’ro’);

117

F.12.1 Cluster Information and Plotting

Function to extract cluster information from label array as returned by kmeans.

% Cluster info: centers, observation indices, member size.

% IN Cls vector with labels as produced by a clustering algorithm

% Pts observations (samples) [nObs nDim]

% minSize minimum cluster size

% strTyp info string

% OUT I .Cen centers

% .Ix indices to points

% .Sz cluster size

%

function I = f_ClsInfo(Cls, DAT, minSize, strTyp)

if nargin<3, minSize = 0; strTyp = ’’; end

if nargin<4, strTyp = ’’; end

nCls = max(Cls); % # of clusters (assuming E [1 nGrp])

nDim = size(DAT,2); % # of dimensions

H = hist(Cls, 1:nCls);

IxMinSz = find(H>=minSize);

I.n = length(IxMinSz); % # of cluster of interest

I.Cen = zeros(I.n,nDim,’single’); % centers [nCls nDim]

I.Ix = cell(I.n,1); % indices of observations

I.Sz = zeros(I.n,1,’single’); % member size (cluster cardinality)

for i = 1:I.n

bCls = Cls==IxMinSz(i); % identify the cluster indices

cen = mean(DAT(bCls,:),1); % center

I.Cen(i,:) = cen;

I.Ix{i} = single(find(bCls)); % actual obs indices

I.Sz(i) = nnz(bCls); % # of members in cluster

end

nP = size(DAT,1);

I.notUsed = nP-sum(I.Sz);

%% ---- Display

fprintf(’%2d Cls %9s Sz %1d-%2d #ObsNotUsed %d oo %d\n’, ...

I.n, strTyp, min(I.Sz), max(I.Sz), I.notUsed, nP);

end % MAIN

Plots clusters, if two-dimensional:

% Plotting 2D clusters.

% IN I struct with center and indices as generated by f_ClsInfo

% DAT observations (samples) [nObs 2]

function [] = p_ClsSimp(I, DAT)

%% ===== All Points in Black =====

plot(DAT(:,1),DAT(:,2),’k.’); hold on;

%% ----- Init Color for Clusters

colormap(’default’); % setting default colormap (avoiding grayscale)

CM = colormap; % obtain the colormap

nCol = size(CM,1);

% permute colormap (to avoid similar colors):

Perm = randperm(nCol); % permuation

CM = CM(Perm(1:2:end),:); % take only every 2nd one

%% ===== LOOP Clusters =====

for i = 1:I.n

Ix = I.Ix{i};

col = CM(i,:);

plot(DAT(Ix,1), DAT(Ix,2), ’o’, ’color’, col, ...

’markerfacecolor’, col); % cluster members

% --- plot center on top

cen = I.Cen(i,:);

plot(cen(1), cen(2), ’x’, ’markersize’, 10);

end

118

end % function

119

F.13 Hierarchical Clustering

The overview in Appendix F.2 showed how to apply the software’s function; here is a slightly more explicituse of the software’s functions, namely we use now the three scripts pdist, linkage and cluster:

clear;

nP = 20;

rng(’default’);

%% All Random

PtsRnd = rand(nP,2); % all random

%% Arc & Square Grid

degirad = pi/180;

wd = 45*degirad;

nap = 10;

yyarc = cos(linspace(-wd,wd,nap))*(0.5)+0.4;

xxarc = linspace(.15,.85,nap);

nsp = 5;

yysqu = repmat(linspace(0.1,0.3,nsp),nsp,1); yysqu = yysqu(:);

xxsqu = repmat(linspace(0.3,0.7,nsp),1,nsp);

PtsPat = [xxarc’ yyarc’];

PtsPat = [PtsPat; [xxsqu’ yysqu]]; % append

%% Clustering Random

DisRnd = pdist(PtsRnd); % pairwise distances

LnkRnd = linkage(DisRnd, ’single’);

[Ln2Rnd NConRnd] = f_LnkTrans(LnkRnd);

ClsRnd = cluster(LnkRnd, ’cutoff’, 0.29, ’criterion’, ’distance’); % 1.14);

DisLnk = sort(LnkRnd(:,3), ’descend’);

DMrnd = squareform(DisRnd);

DMrnd(diag(true(nP,1))) = inf;

[DMrndO ORnd] = sort(DMrnd,2);

NNdi = DMrndO(:,1);

[mxNN1 ixNN1mx] = max(NNdi);

%% Clustering Pattern

DisPat = pdist(PtsPat);

LnkPat = linkage(DisPat, ’single’);

[Ln2Pat NConPat] = f_LnkTrans(LnkPat);

ClsPat = cluster(LnkPat, ’cutoff’, 1.15);

ClsPat = cluster(LnkPat, ’cutoff’, 0.11, ’criterion’, ’distance’);

%% General Stats

fprintf(’#Cls Rnd %d\n’, max(ClsRnd(:)));

fprintf(’#Cls Pat %d\n’, max(ClsPat(:)));

mxl = max([LnkRnd(:,3); LnkPat(:,3)])*1.05; % y-limit

%% Plotting

[rr cc] = deal(3,2);

figure(1); clf;

subplot(rr,cc,1);

scatter(PtsRnd(:,1), PtsRnd(:,2), 100, ClsRnd, ’filled’);

set(gca, ’xlim’, [0 1]);

set(gca, ’ylim’, [0 1]);

p_MST(PtsRnd, LnkRnd, ClsRnd);

title(’Random’, ’fontweight’, ’bold’, ’fontsize’, 12);

plot(PtsRnd(ixNN1mx,1), PtsRnd(ixNN1mx,2), ’k*’);

subplot(rr,cc,2);

scatter(PtsPat(:,1), PtsPat(:,2), 100, ClsPat, ’filled’);

set(gca, ’xlim’, [0 1]);

set(gca, ’ylim’, [0 1]);

p_MST(PtsPat, LnkPat, ClsPat);

title(’Pattern’, ’fontweight’, ’bold’,’fontsize’, 12);

subplot(rr,cc,3);

[HRnd TRng] = dendrogram(LnkRnd);

set(gca,’ylim’,[0 mxl], ’fontsize’, 7);

subplot(rr,cc,4);

[HPat TPat] = dendrogram(LnkPat, 40);

set(gca,’ylim’,[0 mxl], ’fontsize’, 7);

subplot(rr,cc,5);

p_MST2(PtsRnd, Ln2Rnd, ClsRnd, ’entirenum’);

%plot(DisLnk, ’.-’);

subplot(rr,cc,6);

120

p_MST2(PtsPat, Ln2Pat, ClsPat, ’entire’);

F.13.1 Three Functions

Now follow 3 functions for the above script. The first one rearranges the linkage output. The remaining twoones are plotting functions.• Linkage transform:

%TRANSZ Translate output of LINKAGE into another format.

% This is a helper function used by DENDROGRAM and COPHENET.

% In LINKAGE, when a new cluster is formed from cluster i & j, it is

% easier for the latter computation to name the newly formed cluster

% min(i,j). However, this definition makes it hard to understand

% the linkage information. We choose to give the newly formed

% cluster a cluster index M+k, where M is the number of original

% observation, and k means that this new cluster is the kth cluster

% to be formed. This helper function converts the M+k indexing into

% min(i,j) indexing.

function [Z Ncon] = f_LnkTrans(Z)

nL = size(Z,1)+1; % # of leaves

for i = 1:(nL-1)

if Z(i,1) > nL, Z(i,1) = traceback(Z,Z(i,1)); end

if Z(i,2) > nL, Z(i,2) = traceback(Z,Z(i,2)); end

if Z(i,1) > Z(i,2),Z(i,1:2) = Z(i,[2 1]); end

end

Pairs = Z(:,1:2);

Ncon = histc(Pairs(:),1:nL); % # of connections/links

%%

function a = traceback(Z,b)

nL = size(Z,1)+1; % # of leaves

if Z(b-nL,1) > nL, a = traceback(Z,Z(b-nL,1));

else a = Z(b-nL,1); end

if Z(b-nL,2) > nL, c = traceback(Z,Z(b-nL,2));

else c = Z(b-nL,2); end

a = min(a,c);

• Plotting MST, version I:

% Plots minimum spanning

(single-link clustering) for all points

% and for the individual clusters of Cls.

%

function [] = p_MST(Pts, Lnk, Cls, type)

if ~exist(’type’, ’var’), type = ’’; end

hold on;

nPtot = size(Pts,1);

nL = size(Lnk,1);

if nPtot~=(nL+1), error(’Lnk probably not correct: #Pts %d, #Lnk %d’); end

if nPtot==1, pp_Singleton(Pts); return; end

Dis = Lnk(:,3); % distances

Lnk = Lnk(:,1:2); % cluster indices (ix to points and intermed clusters)

maxDist = max(Dis);

Sim = 1.1-Dis./maxDist; % similarity for linewidth

if any(Sim<eps),

warning(’linewidth < 0: %1.5f’, min(Sim));

end

%% ============ ENTIRE MST

121

Cen = zeros(nL,2);

LnkVec = zeros(nL,2,2,’single’);

for i = 1:nL

Ixp = Lnk(i,:); % pair indices

bLef = Ixp<=nPtot; % leaves

if all(bLef), % both are leaves (points)

Xco = Pts(Ixp,1);

Yco = Pts(Ixp,2);

elseif sum(bLef)==1 % one is a leaf (point), the other a cluster

if bLef(1), ixp = Ixp(1); ixc = Ixp(2);

else ixp = Ixp(2); ixc = Ixp(1);

end

Xco = [Pts(ixp,1); Cen(ixc-nPtot,1)];

Yco = [Pts(ixp,2); Cen(ixc-nPtot,2)];

else % both are clusters

Xco = Cen(Ixp-nPtot,1);

Yco = Cen(Ixp-nPtot,2);

end

Cen(i,:) = mean([Xco Yco],1);

LnkVec(i,:,:) = [Xco Yco];

% --- prints entire tree if desired

if strcmp(type, ’entire’)

hp = plot(Xco, Yco, ’color’, ones(1,3)*0.5, ’linestyle’, ’-’);

set(hp, ’linewidth’, Sim(i)*4);

end

end

%% ============ CLUSTER MST

if iscell(Cls),nCls = length(Cls);

else nCls = max(Cls);

end

for i = 1:nCls

if iscell(Cls), IxG = Cls{i}; % pt ixs of cluster (group)

else IxG = find(Cls==i);

end

szG = length(IxG); % group size (#Pts)

if szG==1, pp_Singleton(Pts(IxG,:)); continue; end

Brg = [];

for k = 1:szG

bOcc = Lnk==IxG(k); % find leafs in tree

IxOcc = find(sum(bOcc,2)); % indices

IxL = Lnk(IxOcc,:);

IxB = sum(IxL,2)+nPtot;

Brg = [Brg; setdiff(IxL(:),IxG(k))];

for l = IxOcc

Xco = LnkVec(l,:,1);

Yco = LnkVec(l,:,2);

hp = plot(Xco, Yco, ’color’, ’k’);

set(hp, ’linewidth’, Sim(l)*4);

end

end

B = false(nL,2);

for k = 1:length(Brg)

if Brg(k)<=nPtot, continue; end

B(Lnk==Brg(k)) = true;

end

IxB = []; % find(B(:,1)&B(:,2));

for l = IxB’

Xco = LnkVec(l,:,1);

Yco = LnkVec(l,:,2);

hp = plot(Xco, Yco, ’color’, ’g’);

set(hp, ’linewidth’, Sim(l)*4);

end

% --- connect group’s center point to remaining points

PtsSel = Pts(IxG,:);

cen = mean(PtsSel,1);

for k = 1:szG

122

plot([PtsSel(k,1) cen(1)], [PtsSel(k,2) cen(2)], ’color’, ones(1,3)*0.7);

end

end

%% ------------ Singleton Point

function [] = pp_Singleton(Pts)

plot(Pts(1), Pts(2), ’ko’, ’markersize’, 5);


• Plotting MST, version II:

% Plots minimum spanning tree (single-link clustering) for all points

% and for the individual clusters of Cls.

% sa p_MST

function [] = p_MST2(Pts, Lnk, Cls, type)

if ~exist(’type’, ’var’), type = ’’; end

hold on;

nPtot = size(Pts,1);

nL = size(Lnk,1);

if nPtot~=(nL+1), error(’Lnk probably not correct: #Pts %d, #Lnk %d’); end

if nPtot==1, pp_Singleton(Pts); return; end

Dis = Lnk(:,3); % distances

Lnk = Lnk(:,1:2); % cluster indices (ix to points and intermed clusters)

maxDist = max(Dis);

Sim = 1.1-Dis./maxDist; % similarity for linewidth

%% ============ ENTIRE MST

Cen = zeros(nL,2);

%LnkVec = zeros(nL,2,2,’single’);

for i = 1:nL

Ixp = Lnk(i,:); % pair indices

Xco = Pts(Ixp,1);

Yco = Pts(Ixp,2);

Cen(i,:)= mean([Xco Yco],1);

%LnkVec(i,:,:) = [Xco Yco];

% --- prints entire tree if desired

if strfind(type, ’entire’)

hp = plot(Xco, Yco, ’color’, ones(1,3)*0.5, ’linestyle’, ’-’);

set(hp, ’linewidth’, Sim(i)*4);

end

end

%% ============

if strfind(type, ’num’)

for i = 1:nPtot

Pt = double(Pts(i,:));

text(Pt(1), Pt(2), num2str(i), ’fontsize’, 8);

end

end

return

%% ------------ Singleton Point

function [] = pp_Singleton(Pts)



123

F.14 Classification Example - Decision Tree

clear;

% --- Load the Data

load ionosphere

DAT = X; % renaming data variable

GrpLb = Y; % renaming group variable


% --- Data Info:





%% ===== All-In-One Function ======

TreeCV = fitctree(DAT, GrpLb, ’kfold’,5);

pcTree = 1-kfoldLoss(TreeCV);

fprintf(’Perc correct for Tree %1.4f\n’, pcTree*100);

view(TreeCV.Trained{1},’Mode’,’graph’);

%% ===== Folding Explicitly (2 folds only) ======

Btrn = false(nSmp,1);

IxTrn = randsample(nSmp,round(0.5*nSmp)); % random training indices

Btrn(IxTrn) = true; % logical vector with training=ON

Btst = Btrn==false; % logical vector with testing=ON

Tree = fitctree(DAT(Btrn,:), GrpLb(Btrn));

LbPred = predict(Tree,DAT(Btst,:));

Bhit = LbPred==GrpLb(Btst); % binary vector with hits equal 1

cTst = nnz(Btst);

fprintf(’Perc correct %1.3f\n’, nnz(Bhit)/cTst*100);

view(Tree,’Mode’,’graph’);

%% ===== Test Different Leave Sizes ======

SzLeaf = logspace(1,2,10);

nSz = numel(SzLeaf);

Pc = zeros(nSz,1);

for i = 1:nSz

Tree = fitctree(DAT,GrpLb, ’kfold’,5, ’MinLeaf’,SzLeaf(i));

Pc(i) = 1-kfoldLoss(Tree);

end

%% ----- Plotting -----

figure(3); clf; hold on;

plot(SzLeaf,Pc*100);

plot([1 SzLeaf(end)],ones(1,2)*pcTree*100);

xlabel(’Min Leaf Size’);


124

F.15 Classification Example - Ensemble Voting

The following Matlab example shows how to program a voting classifier explicitly, it is essentially only themaximum function that represents the crucial part - anything else is as introduced before.

clear;



DAT = meas(:,1:3); % renaming data variable



% --- Data Info:





% --- Folds

nFld = 5;

IxFlds = crossvalind(’kfold’, GrpLb, nFld);

%% --------- LOOP FOLDS ----------------

Pc = zeros(nFld,4);

for i = 1:nFld

% --- Prepare fold

Btst = IxFlds==i; % logical vector identifying testing samples

Btrn = ~Btst; % logical vector identifying training samples



nTst = length(Grp.Test);

% --- Training Individual

MdNB = fitcnb(DAT(Btrn,:), Grp.Tren);

MdDC = fitcdiscr(DAT(Btrn,:), Grp.Tren);

MdRF = TreeBagger(10, DAT(Btrn,:), Grp.Tren);

% --- Testing Individual for Comparison

[PrdNB ScNB] = predict(MdNB, DAT(Btst,:));

[PrdDC ScDC] = predict(MdDC, DAT(Btst,:));

[PrdRF ScRF] = predict(MdRF, DAT(Btst,:));

pcNB = nnz(PrdNB==Grp.Test)/nTst*100;

pcDC = nnz(PrdDC==Grp.Test)/nTst*100;

pcRF = nnz(cellfun(@str2num,PrdRF)==Grp.Test)/nTst*100;

Pc(i,1:3) = [pcNB pcDC pcRF];

% --- Testing Ensemble

SC = cat(3,ScNB,ScDC,ScRF); % [nTst nCat nClassifiers]

ScEns = max(SC,[],3);

[~,PrdEns] = max(ScEns,[],2);

Pc(i,4) = nnz(PrdEns==Grp.Test)/nTst*100;

end

fprintf(’Perc correct %1.2f\n’, mean(Pc,1));

The following Python code exemplifies how to use the all-in-one function function.


from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier, VotingClassifier


DAT, Grp = iris.data[:, 1:3], iris.target

## ------- Individual Classifers

CfLR = LogisticRegression(random_state=1)

CfRF = RandomForestClassifier(random_state=1)

CfNB = GaussianNB()

## ------- Ensemble Classifier

CfEns = VotingClassifier(estimators=[(’lr’, CfLR),

(’rf’, CfRF),

(’gnb’, CfNB)], voting=’hard’)

125

## ------- Evaluation

for Clf, Lb in zip([CfLR, CfRF, CfNB, CfEns],

[’Logistic Regression’, ’Random Forest’, ’naive Bayes’, ’Ensemble’]):

Scrs = cross_val_score(Clf, DAT, Grp, cv=5, scoring=’accuracy’)

print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (Scrs.mean(), Scrs.std(), Lb))

126

F.16 Classification Example - Random Forest

clear;

% --- Load the Data

load ionosphere

DAT = X; % renaming data variable

GrpLb = Y; % renaming group variable


% --- Data Info:





%% ===== Single Tree (for comparison) ======

Tree = fitctree(DAT, GrpLb, ’kfold’,5);

pcTree = 1-kfoldLoss(Tree);

fprintf(’Perc correct for Tree %1.4f\n’, pcTree*100);

%% ===== Random Forest ========

NWk = [1 2 5 10 20 50]; % weak classifiers

nWkt = length(NWk);

Pc = zeros(nWkt,1);


for k = 1:nWkt

nWk = NWk(k);

Forest = TreeBagger(nWk, DAT, GrpLb, ’OOBPred’,’on’);

Pcs = 1-oobError(Forest);

Pc(k) = mean(Pcs)*100;

figure(1); plot(Pcs); pause(.2);

end

%% ===== Folding Explicitly (2 folds only) ======

Btrn = false(nSmp,1);

IxTrn = randsample(nSmp,round(0.5*nSmp)); % random training indices

Btrn(IxTrn) = true; % logical vector with training=ON

Btst = Btrn==false; % logical vector with testing=ON

Forest = TreeBagger(20, DAT(Btrn,:), GrpLb(Btrn), ’OOBPred’,’on’);

LbPredStr = predict(Forest, DAT(Btst,:)); % string labels!!

% ---- Convert string labels to numeric labels

cTst = nnz(Btst);

LbPred = cellfun(@str2num, LbPredStr); % converts string labels to scalar

% ---- Calculate prediction

Bhit = LbPred==GrpLb(Btst); % logical vector with hits equal 1

fprintf(’Perc correct %1.3f\n’, nnz(Bhit)/cTst*100);

%% ----- Plot Results

figure(2);clf;

plot(NWk,Pc); hold on;

plot([1 NWk(end)],ones(1,2)*pcTree*100);

legend(’Forest’,’Single Tree’);

xlabel(’# of Weak Learners’);


127

F.17 Example Density Estimation

F.17.1 Histogramming and Parzen Window

The following script compares histogramming and the density smoothening using Parzen Windows:

clear; rng(’default’);

%% ----- An artifical Data Set

X = [randn(30,1)*5; 10+rand(60,1)*8]; % synthetic data

nP = length(X); % number of data points

%% ===== Histogramming

Edg = linspace(-15,20,35); % edges for bins

H = histcounts(X, Edg); % histogramming

%% ===== Density Estimation

[Pz Ve] = ksdensity(X, Edg(1:end-1)+0.5); % parzen window

%% ----- Plotting

figure(1); clf;

plot(X, zeros(nP,1)-.5,’k.’,’markersize’,12); hold on;

bar(Edg(1:end-1), H, ’histc’);

plot(Ve, Pz*nP, ’g.-’,’markersize’,12);

legend(’Data’, ’Histogram’, ’Kernel’, ’location’, ’northwest’);

set(gca,’ylim’, [-.95 max(H(:))]);

set(gcf,’paperposition’,[0 0 9 4]);

%% ===== Density Estimation: own implementation

bandWth = 1;

PtEv = linspace(X(1),X(end),nP); % locations of evaluation

PzOwn = zeros(nP,1);

for i = 1:nP

PzOwn(i) = sum(pdf(’Normal’, X, PtEv(i), bandWth)) / (nP*bandWth);

end

PzMlb = ksdensity(X,PtEv,’width’,bandWth);% for comparison

% --- Plotting

figure(2);clf;

plot(PtEv, PzOwn, ’g.’); hold on;

plot(PtEv, PzMlb, ’b’);

F.17.2 N-Dimensional Histogramming

A simple function for histogramming in multiple dimensions:

% N-dim histogram.

% IN DAT data matrix [nPts nDim]

% nEdg # of edges (# bins = nEdg-1)

% OUT H n-dimensional histogram [nEdg nEdg ...nEdg]

% aEdg list of edges for each dimension

function [H aEdg] = histcn(DAT, nEdg)

[nP nDim] = size(DAT);

IX = zeros(nP,nDim,’single’);

aEdg = cell(nDim,1);

for i = 1:nDim

Dat = DAT(:,i);

Edg = linspace(min(Dat),max(Dat),nEdg);

[~,Ix] = histc(Dat, Edg);

IX(:,i) = Ix;

aEdg{i} = single(Edg);

end

H = single(accumarray(IX,1,ones(1,nDim,’single’)*nEdg));

128

F.17.3 Gaussian Mixture Model

An example of how to apply the function gmdistribution:

clear

rng(’default’);

X = [randn(30,1)*5; 10+rand(60,1)*8]; % synthetic data

nP = length(X); % number of data points

EvPt = linspace(min(X),max(X),120);

Ogm = gmdistribution.fit(X,2); % we assume 2 peaks

Gm = pdf(Ogm,EvPt’); % create the estimate

[Pz Ve] = ksdensity(X, EvPt); % parzen window for comparison

%% ---- Plotting


plot(Ve, Gm*nP,’.m’);

plot(Ve, Pz*nP, ’g’);

plot(X, zeros(nP,1)-.5,’.’);

legend(’GMM’, ’Parzen Win’, ’location’, ’northwest’);

set(gca,’ylim’, [-.8 max([Ve(:); max(Gm(:))])]);

129

F.18 Classification Example - Naive Bayes

The overview in F.1 already gave an example of how to apply the all-in-one function in Matlab and Python.Here we give an explicit example of how one could implement a Naive Bayes model in Matlab.

clear;

rng(’default’);

S1 = [2 1.5; 1.5 3]; % covariance for multi-variate normal distribution

MuCls1 = [0.3 0.5]; % two means (mus) for class 1

MuCls2 = [3.2 0.5]; % " " " " class 2

PC1 = mvnrnd(MuCls1, S1, 50); % training class 1

TEST = mvnrnd(MuCls1, S1, 30); % testing (class 1)

PC2 = mvnrnd(MuCls2, S1, 50); % training class 2

TREN = [PC1; PC2]; % training set

Grp = [ones(size(PC1,1),1); ones(size(PC2,1),1)*2]; % group variable

%% ========== NAIVE BAYES =============

[nCat nDim] = deal(2,2);

% ===== Build class information for TRAINING set:

AVG = zeros(nCat,nDim);

[COV COVInv] = deal(zeros(nCat,nDim,nDim));

CovDet = zeros(nCat,1);

for k = 1:nCat

TrnCat = TREN(Grp==k, :); % [nCatSamp nDim]

AVG(k,:) = mean(TrnCat); % [nCat nDim]

CovCat = cov(TrnCat); % [nDim nDim]

COV(k,:,:) = CovCat; % [nCat, nDim, nDim]

CovDet(k) = det(CovCat); % determinant

COVInv(k,:,:) = pinv(CovCat); % p inverse

end

% ===== Testing a (single) sample with index ix (from TESTING set):

Prob = zeros(nCat,1); % initialize probabilites

for k = 1:nCat

detCat = abs(CovDet(k)); % retrieve class determinant

CovInv = squeeze(COVInv(k,:,:)); % retrieve class inverse

fct = 1/( ( (2*pi)^(nDim/2) )*sqrt(detCat) +eps);

Df = AVG(k,:)-TEST(1,:); % diff between avg and sample

Mah = (Df * CovInv * Df’)/2; % Mahalanobis distance

Prob(k) = fct * exp(-Mah); % probability for this class

end

[mxc ixc] = max(Prob); % final decision (class winner)

130

F.19 Classification Example - SVM

The overview in F.1 already gave two examples.

131

F.20 Clustering Example - Fuzzy C-Means

% Fuzzy c-Means.

% IN DAT data matrix [nObs nDim]

% nCls k (no. of clusters)

% Opt options:

% .expU exponent for the matrix U (default: 2.0)

% .mxIter maximum number of iterations (default: 100)

% .minImprov minimum amount of improvement (default: 1e-5)

% .bDisp info display during iteration (default: true)

%

% OUT Cen cluster centers [nCls nDim]

% U membership grade matrix [nCls nObs]

% 0 = no; 1 = full membership.

% spd spread (objective function: here it is sum of distances)

%

function [Cen U spd] = f_KmnsFuz(DAT, nCls, Opt)

%% ---------- Options ------------

OptDef = struct(’expU’,2, ’mxIter’,100, ’minImprov’,1e-5, ’bDisp’,true);

if nargin==2,

Opt = OptDef;

else

% verifying exponent

assert(Opt.expU>=1, ’The exponent should be >= 1!’);

end

expo = Opt.expU; % exponent for U

mxIter = Opt.mxIter; % max iteration

minImpro = Opt.minImprov; % min improvement

bDisp = Opt.bDisp; % display progress

%% ---------- Init

[nObs nDim] = size(DAT);

Spd = zeros(mxIter,1); % array for objective function

DM = zeros(nCls, nObs, ’single’);

RepDim = ones(nDim,1,’single’);

RepObs = ones(nObs,1,’single’);

RepCls = ones(nCls,1,’single’);

% --- Init U: must sum to 1 per cluster, as required by fuzzy c-means

U = rand(nCls,nObs,’single’);

Usum = sum(U);

U = U./Usum(RepCls,:);

%% ========= LOOP ===========

for i = 1:mxIter,

Uexp = U.^expo;

Cen = Uexp*DAT./((RepDim*sum(Uexp,2)’)’); % new centers

% ===== Distance Matrix Cen-DAT ======

if nDim>1,

for k = 1:nCls

DM(k,:) = sqrt(sum((DAT-RepObs*Cen(k,:)).^2,2));

end

else % 1-D data

for k = 1:nCls

DM(k,:) = abs(Cen(k)-DAT)’;

end

end

% --- spread and new U

spd = sum(sum((DM.^2).*Uexp)); % objective function

Dpo = DM.^(-2/(expo-1)); % new U, suppose expo != 1

U = Dpo./(RepCls*sum(Dpo));

Spd(i) = spd;

if bDisp, fprintf(’%3d spread %4.1f\n’, i, spd); end

132

% --- break if hardly any improvement

if i>1,

if abs(spd-Spd(i-1)) < minImpro,

break;

end

end

end % for iteration

end % MAIN

133

F.21 Clustering Example - DBSCAN

% Fast density-based clustering with the DBSCAN algorithm.

% IN DAT data [nObs nDim]

% minPts min no. of points per cluster

% epsi max radius

% OUT Grp vector with cluster labels

% Typ vector with point type: -1=outlier; 0=border; 1=core

% epsi calculated radius if not specified as 3rd input argument

function [Grp Typ epsi] = f_DbScan(DAT, minPts, epsi)

[nObs nDim] = size(DAT);

% --- estimate eps if not present

if nargin<3 || isempty(epsi)

rp = prod(range(DAT,1));

dv = nObs*sqrt(pi.^nDim);

epsi = (rp*minPts*gamma(.5*nDim+1)/dv).^(1/nDim);

end

% --- Init

ObsNo = 1:nObs;

Typ = zeros(1,nObs,’single’); % core/border/outlier

Grp = zeros(1,nObs,’single’); % group variable

Usd = false(nObs,1); % used/visited

cGrp = 1; % group counter

%% ========== LOOP Observations =========

for i = 1:nObs

if Usd(i), continue; end

% neiborhood:

Dis = ff_Dist(i);

IxN = find(Dis<=epsi); % neibor indices for radius = epsi

nN = length(IxN); % # neibors

% ----exactly one neibor (itself)

if nN==1

Typ(i) = -1; % mark as outlier

Grp(i) = -1; % mark as outlier

Usd(i) = true; % mark as used (visited)

end

% ----a few neibors

if nN>1 && nN<minPts

Typ(i) = 0; % mark as border

Grp(i) = 0; % mark as border

end

% ----sufficient neibors

if nN>=minPts;

Typ(i) = 1; % core

Grp(IxN) = ones(nN,1)*cGrp;

% ====== LOOP NEIBORS ======

while ~isempty(IxN)

ix1 = IxN(1);

Usd(ix1)= true;

IxN(1) = [];

% neiborhood II:

Dis = ff_Dist(ix1);

IxN2 = find(Dis<=epsi);

nN2 = length(IxN2);

if nN2>1

Grp(IxN2) = cGrp;

obsNo = ObsNo(ix1);

if nN2 >= minPts;

Typ(obsNo) = 1; % core

else

Typ(obsNo) = 0; % border

end

% ---- Loop Neibors II

134

for j = 1:nN2

ixN2 = IxN2(j);

if Usd(ixN2), continue; end

Usd(ixN2) = true;

Grp(ixN2) = cGrp;

IxN = [IxN ixN2];

end

end

end

cGrp = cGrp+1;

end

end % for nObs

IxG0 = find(Grp==0); % identify unused

Grp(IxG0) = -1; % mark as outliers

Typ(IxG0) = -1; % mark as outliers

% ------ Distance Sample/Observation to all Others

function Dis = ff_Dist(ix)

obs = DAT(ix,:);

Df = ones(nObs,1)*obs-DAT;

Dis = sqrt(sum(Df.^2, 2))’;

if nDim==1, Dis = abs(Df)’; end

end

end

135

F.22 Clustering Example - Clustering Tendency

A function for measuring the presence of clusters in a dataset. The corresponding testing script is append-ed.

% Measuring cluster tendency with the Hopkins-Test for Randomness.

% sa ThKo p901, pdf908

%

% IN PTS observations [nPts nDim]

% OUT hop Hopkins measure:

% 0.5 random

% >0.5 clusters present

% <0.5 regularity present

%

function hop = f_RndHopkins(PTS)

[nPts nDim] = size(PTS);

%% ----- Samples

nSmp = round(nPts*0.1); % take a fraction of all samples

PRND = rand(nSmp,nDim); % generate random samples (X’)

Ornd = randperm(nPts); % generate random order

IxSmp = Ornd(1:nSmp);

PSMP = PTS(IxSmp,:); % samples from PTS (X1)

%% ----- Calculate NN distances

[~,DOW] = knnsearch(PSMP, PSMP, ’k’, 2); % delta_j

[~,Dsr] = knnsearch(PSMP, PRND); % d_j

%% ----- Hopkins

sDow = sum(DOW(:,2).^nDim); % power dimensionality and sum

sDsr = sum(Dsr.^nDim); % power dimensionality and sum

hop = sDsr / (sDow + sDsr); % Hopkins measure

fprintf(’Hopkins %1.2f (nSmp=%d)\n’, hop, nSmp);

end

And here the corresponding testing script using artificial data:

clear;

nRep = 1000;

figure(1);clf;

%% ----- Regular Grid

[X Y] = meshgrid(0:30,0:30);

PTS = [X(:) Y(:)]./30; % regular grid


PTS = PTS(randperm(nPts),:);

for i = 1:nRep, Hop(i) = f_RndHopkins(PTS); end

ts = sprintf(’Hopkins avg %1.2f +- %1.2f\n’, mean(Hop), std(Hop));

subplot(2,2,1);plot(PTS(:,1),PTS(:,2),’.’);titleb(ts);

%% ----- Random

PTS = rand(1000,2); % random points




%% ----- Cluster

PTS1 = rand(500,2); % random points

PTS2 = rand(500,2)/2+0.15; % random points dense

PTS = [PTS1; PTS2]; % cluster in lower left


PTS = PTS(randperm(nPts),:);




136

About the Author C. Rasche is a researcher in computer vision focusing on image classification, objectdetection, shape recognition, medical image analysis etc. To solve those tasks, the study of pattern recog-nition and machine learning is inevitable. C. Rasche uses both traditional and modern classifiers and hasoften achieved very similar performances with either classifier; he thereby gained the experience that thedevelopment of simple, task-specific classifiers - in particular ensemble classifiers -, is the quickest way toachieve a fairly high performance without needing to wait for the long learning process of modern classifiers.https://www.researchgate.net/profile/Christoph_Rasche

137

https://www.researchgate.net/profile/Christoph_Rasche

Documents

Workbook Pattern Recognition - alpha.imag.pub.roalpha.imag.pub.ro/~rasche/course/patrec/patrec1.pdf · Workbook Pattern Recognition An Introduction for Engineers ... The emphasis