THE USE OF UNLABELLED DATA FOR SUPERVISED ...nlc-bnc.ca/obj/s4/f2/dsk3/ftp04/MQ61889.pdfAs for its practical definition, the classifkation is involved in the construction of a procedure

THE USE OF UNLABELLED DATA FOR SUPERVISED LEARNING

A Thesis

Presented to

The Faculty of Graduate Studies

of

The University of Guelph

by

ROZITA D A M

In partial fulfiiment of requirements

for the degree of

Mas ter of Science

August, 2001

@Rozita Dara, 2001

National Library 1*1 of Canada Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographic Services services bibliographiques 395 Wellington Street 395, rue Wellington OttawaON K7AON4 Ottawa ON K I A ON4 Canada Canada

Your file Voue dltimna

Our fi& Noire réldcollc~

The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microfonn, paper or electronic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fiùn, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation,

ABSTRACT

THE USE OF UNLABELLED DATA FOR SUPERVISED LEAR.NING

Rozita Dara University of Guelph, 2001

Advisor: Professor D. A. Stacey and Professor S. C. Kremer

When provided with enough labelled training examples, a supervised learnllig algorithm

can learn reasonably accurately. However, creat ing sufEcient labeUed data t O train accurate

classifiers is time consuming and expensive. On the other hand, unlabelled data is u s u d y

easy to obtain. This research introduces a noveI approach, Guelph Cluster Class (GCC),

which improves the task of classification with the use of unlabelled data- The novelty of

this approach Lies in the use of an unsupervised network, Self-Orgunizing Map, to select

natural clusters in labelled and unlabelled data. Subclasses (made by labelled data) are

used to assign labels to udabelled patterns to produce self-lubelled data. The performance

of several variants of the GCC system have been obtained by running a Buck-Propagation

network on labelled and self-labelied data. Results of experïments on several benchmark

datasets demonstrate an increasing power for the classification procedure even when the

number of labelled data is very small.

Acknowledgement s

1 am indebted to my advisors:

Dr. Deborah Stacey for her support, mentorship and guidance in the course of m y

studies in Guelph. The extent and depth of her intelligence have never failed to

inspire me. Deb's honest concern for students and her pleasant disposition made my

interaction with her a t d y rewarding experience-

Dr. Stefan Kremer who generously provided me with encouragement, support and

creative insight. His enthusiasm for science has laid the groundwork for my future

research career in science.

1 would &O like to thank my advisory cornmittee member, Dr. David Calvert, for

directing me throughout my research.

1 am quite fortunate to have been surrounded by caring and funny friends and colleagues

(Rami Zeineh, Narendra Pershad, Sudip Biswas, Orlando Cicchello, Saira Ahmad, Neil

Harvey aud R;rmiri Farshad Tabrizi) who made the t h e that 1 spent in Guelph enjoyable.

This thesis is dedicate to my husband, Shayan Sharif, who provided endless supplies of

encouragement and advise. He kept me motivated through the tough times and this thesis

would not be possible without his generous efforts. 1 am greatly thankful to him.

Contents

1 Introduction

2 Literature Review 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Unsupervised learning 6

. . . . . . . . . . . . . . . . . . . . . . . 2.3 SelfOrganizingFeatureMap . .. 7

2.3.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 TheLearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 9

. . . . . . . . . . . . 2.3.3 The SOM properties useful in data exploration 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Supervised Learning 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Back-propagation 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Previous Work 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusion 22

3 Implementation 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 linplementation 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Data 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Labelling Process 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Re-labelling 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Neighbouring 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Classification 30

3.2.6 The GCC (Guelph Cluster Class) Algorithm . . . . . . . . . . . . . 31

. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Znitial labelled Data Selection 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experiments .. 32

3.5 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion 40

4 Experiments 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction 42

. . . . . . . . . . . . . . . . . . . . . . . . 4.2 Data Collection and Description 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experiments 47

4.3.1 Experiment #l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Experiment #2 55

4.3.3 Experiment #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3.4 Experiment #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Conclusion 67

5 Analysis and Discussion 68

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5 -2 Analysis of Experiment #1 72

... lll

5.2.1 Analysis of Experiment #4 . . . . . . . . . . . . . . . . . . . . . . . 75

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis of Experiment #2 76

. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Andysis of Experiment #3 77

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion 80

6 Conclusions 82

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Future Work 84

A Experimental Ptesualts 85

. . . . . . . . . . . . . . . . . . A.1 Experiments 1. 2 & 3: Labelling Procedure 85

A.2 Experiments 1, 2 & 3: Classification stage (Results on test data) . . . . . . 92

. . . . . . . . . . . A.3 Experiment 4: Classification stage (Results on test data) 96

List of Tables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Conservative method 35

3.2 Alliance method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Number of Labelled. Udabelled. and Test data . . . . . . . . . . . . . . . . 49

Size of Classes in Labelled and UnlabeUed data . . . . . . . . . . . . . . . . 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Networks Parameters 51

Confusion Matrix for UnlabeUed data (E . coli) . . . . . . . . . . . . . . . . 52

Confusion Matrix for Unlabelled data (Mushroom) . . . . . . . . . . . . . . 53

. . . . . . . . . . . . . . . . . Network Parameters for the Alliance method 56

Alliance and Conservative Results . . . . . . . . . . . . . . . . . . . . . . . 57

. . . . . . . . . . . . . . . . . . . . . . Size of Labelled and Unlabelled data 64

Labelled Data Selection ResuIts . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1 Self-labelling accuracy in Alliance and Conservative methods . . . . . . . . 77

List of Figures

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Un-Supervised Learning 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Self Organizing Map 8

2.3 Neighbourhood Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 SupervisedLearning 14

. . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Back-propagation Architecture 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Labelling Process 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Neighbourhood Process 30

3.3 Iris Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Self-labelling accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Self-labelling abilïty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 BP performance on Test data, IRIS dataset . . . . . . . . . . . . . . . . . . 39

3.7 Testing performance (BP network) on selected IRIS labelled dataset . . . . 40

4.1 E . Coli data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Breast Cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Mushroom data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Diabetes data 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Heart Disease data 47

. . . . . . . . . . . . 4.6 GCC ability improving classification of E . coli dataset 54

. . . . . . . . . . 4.7 GCC ability improving classïfkation of Mushroom dataset 54

4.8 Labelling ability through difFerent techniques on the B m s t - Cancer dataset 59

4.9 GCC ability on improving classification of Breast-Cancer dataset . . . . . . 59

4-10 Labelling ability througb difTerent techniques on the Diubetes dataset . . 60

. . . . . . . . . 4.11 GCC ability on improving classincation of Diabetes dataset 60

4-12 Testing performance (BP network) on selected Heart-Diseuse labelled data 66

4.13 Testing performance (BP network) on selected Mushroorn labelled data . . 66

. . . . . . . . . . . . . . . . . . . . . 5.1 IRIS data ordered map (training data) 69

. . . . . . . . . . . . . . . . . . . . . . . 5.2 A sample labelled data distribution 70

. . . . . . . . . . . . . . 5.3 A sampie labelled and unlabelled data distribution 70

. . . . . . . . . . . . . . . . . . . . 5.4 SOM map for 15 IRIS labelled patterns 73

. . . . . . . . . . . . . . . . . . . . . 5.5 SOM map for 7 IRIS labelled pattern 73

5.6 Number of mis-classifled data versus BP network's performance on the test

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . data 75

5.7 Test on E . coli: relabelling accuracy versus number of relabelling process . 80

. . . . . . . . . . . . . . . . . . . . . . . . A . l LabelhgAccuracy. IRISdataset 86

. . . . . . . . . . . . . . . . . . . . . . . . . A.2 Labefluig Ability. IRIS dataset 86

A.3 Labelling Accuracy. E . coli dataset . . . . . . . . . . . . . . . . . . . . . . . 87

A.4 Labelling Ability. E . coli dataset . . . . . . . .. . . . . . . . . . . . . . . . 87

A.5 LabellingAccuracy. Breast Cancerdataset . . . . . . . . . . . . . . . . . . . 88

vii

. . . . . . . . . . . . . . . . . . . . A.6 Labelling Ability. Breast Cancer dataset 88

. . . . . . . . . . . . . . . . . . . . . A.7 Labelling Accuracy, Mushroom dataset 89

. . . . . . . . . . . . . . . . . . . . . . A.8 Labelling Ability, Mushroom dataset 89

. . . . . . . . . . . . . . . . . . . . . . A.9 LabellingAccuracy, Diabetesdataset 90

. . . . . . . . . . . . . . . . . . . . . . . A.10 Labelling Ability, Diabetes dataset 90

. . . . . . . . . . . . . . . . . . . A . 11 Labelling Accuracy, Heart Diseose dataset 91

. . . . . . . . . . . . . . . . . . . . A.12 Labelling Ability, neart Diseuse dataset 91

. . . . . . . . . . . . . . . . . . . . . . A . 13 BP performance on the IRIS dataset 92

A.14 BP performance on the E . coli dataset . . . . . . . . . . . . . . . . - . . . . 93

A.15 BP performance on the Breust Cancer dataset . . . . . . . . . . . . . . . . 93

A.16 BP performance on the Mushroorn dataset . . . . . . . . . . . . . . . . . 94

A.17BPperfomanceontheDiabetesdataset. . . . . . . . . . . . . . . 94

A.18 BP performance on the Heurt Disease dataset . . . . . . . . . . . . . . . . . 95

A . 19 BP performance on selected IRIS labelled data . . . . . . . . . . . . . . . . 96

A.20 BP performance on selected E . cole labelled data . . . . . . . . . . . . . . . 96

A.21 BP performance on selected Breast Cancer labelled data . . . . . . . . . . . 97

A.22 BP performance on selected Mushroorn IabelIed data . . . . . . . . . . . . . 97

. . . . . . . . . . . . . . A.23BPperformanceonselectedDiabeteslabelTeddata 98

A.24 BP performance on selected Heart Disease labelled data . . . . . . . . . . . 98

Chapter 1

Introduction

The task of classéfication occurs in a wide range of everyday human activities. It is applied

in any context where a decision or forecast is to be made based on the currently available

information, and a classification procedure is a technique for repeatedly making such deci-

sions in new situations. As for its practical definition, the classifkation is involved in the

construction of a procedure that will be applied on a sequence of cases with pre-defined

classes, in which each new case mwt be assigned to one of those classes based on previously

seen examples. The construction of a classification procedure has various names, such as

pattern recognition, discrimination, and supervised learning. Several approaches have been

taken toward this task. The main historical branches of research are: statistics, machine

learning, and neural networks.

When provided with enough labelled training samples, a variety of supervised learning

algorithms can learn to be reasonably accurate classifiers. Practically, rnost of the classifica-

tion procedures need large amounts of data for training, especidy when the dimensionality

of the input features or number of classes is large. However, problems arise when there is

an insuflicient number of labelled training examples available. Creating a sdficient amount

of labelled data is tedious and expensive, since they often have to be labelled manually.

This lack of data can cause poor estimation of the parameters which will cause inaccurate

generalizat ion of the unseen data. Anot her pro blem is wirepresent ative training samples

which will raise diiliculties in analyzing the data- The training samples from one region of

input space for a class might not be a good representative of the samples fkom the same

class in other regions (Shahshahani, 1994).

Similar t O other classification t ethniques, supervised art ificial neural networks (e.g. feed-

forward networks) may s d e r ikom a lack of training samples. In contrast to labelled data,

unlabelled data may be generated more easily. As a result, it would be extremely useful if

it were possible to use unlabelled data in the supervised learning procedure. This research

and some other studies have integrated unlabelled data into supervised learning procedure-

Theoretical aspects of the effect of labelled and unlabelled data in classification have

been previously ex;imined by severd researchers (e-g. (Castelli and Cover, l995)). In (Blum

and Mitchell, 1998), they have used cetraining and redundantly siifficient features in their

approach. A major assumption in this technique is the existence of a natural separation

of their input features into two disjoint sets in each dataset which may not be found in

all datasets. In another approach by (Nigam et al., 1998); a combination of Ezpectation

Maximziation and naive Bayes classification is used. In this method, the naive Bayes

classifier is used to make an initial classifier using labelled data. Then, E M is applied to

assign probabilistically weighted labels to unlabelled data. In (McCallum and Nigam, 1999),

a completely unsupemked approach is used for the task of trainhg a text classitier. They

use the bootstrapping algorithm which is a combination of EM and hierarchical shrinkage

in their approach.

This thesis introduces a novel algorithm (Guelph Cluster Class (GCC)) that improves

classification using unlabelled data. Instead of training a supervised learning system with

a small number of labelled data, GCC applies an unsupervised artficial neural network

(Self Organizing Map (SOM)) to assign labels to unlabelled data and obtain more labelled

data. The GCC picks out naturd clusters in the input data. These clusters are gathered

in such a way that the topology in the output space corresponds to the topology in the

input space. Then, the labelled data along with the SOM work as a classifier and assign

labels to dabelled data to provide self-labelled data. Self-labelled data WU then be used

to reorganize the clusters and to provide more accurate training samples for the supervised

system. Ultimately, a supervised learnuig network, Buck-propagation, is used to test self-

labeiled data produced by the GCC system-

In order to test the GCC's performance, several experiments have been performed on

six benchmark datasets with dxerent statistical characteristics and various degree of clas-

sification difEculty. The goal of each experiment is to investigate the SOM'S ability to

produce more accurate self-labelled data. One of the experiment concentrates on training

the SOM with Werent inputs: labelled data (the Consemutive method) and a combina-

tion of labelled and unlabelled data (the Alliance method). In another experiment, the

process of assignhg labels to dabelled data is extended to neighbouring nodes (Neigh-

bouring method). In addition, a re-labelling method and the degree of mis-classification of

the unlabelled data is evaluated. In a completely unsupervised approach, SOM is used to

select the appropriate labelled data (active learning). Selection of labelled data is useful i)

in reducing the amount of labelled data needed for the classi6cation procedure (reduces the

cost of labor), ai) in selecting the most effective labelled data.

This research shows how the GCC system can be used to obtain sficient amounts

of self-labelled data with sufficient accuracy to increase the generalization ability of a BP

network. In addition, it shows how clusters produced by the SOM can be used to select the

initial labelled samples. In conclusion, it can be said that self-labelled data obtained fiom

the GCC approach have a significant effect on the improvement of a supervised learning

pro cedure.

This thesis is partitioned as follows: Chapter 2 is an introduction to the background

information in the field of neural networks, especially those techniques that are used in this

thesis. In addition, it consists of an overview of the problem and the current approaches

used by other research groups. Chapter 3 concentrates on the problem space and provides

t heoretical and practical information about the proposed approach (GCC) Furthemore,

it discusses several experiments that have been carried out to explore variants of the GCC

system. Chapter 4 presents the datasets and the results of the experiments. Chapter 5

contains analysis and discussion of the results presented in Chapter 4. Finally Chapter 6

is a siimmary and conclusion of the previous chapters and suggested future work for th%

research.

Chapter 2

Literat ure Review

2.1 Introduction

Classification has a wide variety of applications in fields such as indus try, commerce, and

research. The purpose of classification is to find interesting patterns and new knowledge

from databases where the dimensionality, complexity, or amount of data is significantly

large and manual analysis is impossible.

The field of ArtScial Neural Networks is one of the popular research approaches in

pattern recognition. One of the important aspects of neural networks is the challenge of

reproducing intelligence itself- This results in unique properties for neural network systems

which are generalization oves unseen data and overcoming the problem of data complexity.

Neural network approaches combine the complexity of some statistical techniques with the

objective of simulating human intelligence; however, this is done at a more unconscious

level. The majority of the work in this field can be grouped b t o two learning frameworks:

supermised and unsupervised. The following sections contain a brief introduction to the

concept of s u p e ~ i s e d and unsupervised learaing in addition to the networks used in this

t hesis.

2.2 Unsupervised learning

The basic notion of unsupervised learning is that no target values are involved in the

learning process. The algorithm attempts to learn the structure of the examples without a

teacher defining the classes prior to the procedure- Tu'o error feedback is provided during the

procedure (Figure 2.1) and the system Iearns to adapt based on the results that have been

collected f?om the previous training patterns and some form of internai distance measure.

The results of such a system wodd be a siimmary of some properties of the objects in the

dat abase.

Figure S. 1: Un-Supervised Learnirig

CLUSTERING

There are several reasons for being interested in the unsupervised procedure:

HUMAN . INFORMATION

Data Reduction (Clustering): The goal of clustering is to reduce the amount of data by

categorizing or grouping simrlar data items together . One major motivation for using

clustering algorithms is to provide automated tools to construct categories (clusters)

and to minimize the effects of humans in the process.

Dimensionality Reduction (Projection): Projection methods are used to reduce the

dimensionali@ of the data items. The goal of the methods is to represent the input

data items in a lower-dimensional space in such a way that important properties of

the data set are presenred as much as possible-

* Data Visualkation: Unsupervised learning is very useful for the visiialization of high-

diniensional data items. There are several methods that may be used for this purpose

such as Andrews' Curve, Chernoff's faces, and five nuniber summaries.

Classification: Ln many applications of real world data, datasets with targets are

scarce. Databases are relatively large and too complicated to be classified by humans.

Unsupervised methods may be used to create targets.

One of the major disadvantages of unsupemised learning is its inability to perform

the classification task. In order to perform ciassification, human information is required to

transform the output of unsupervised learning systems into classes. Recent ly, there has been

much interes t in the use of unsupervised neural computation met hods. The Self-Organizing

Map is one of the most popular models in this field.

2.3 Self Organizing Feature Map

The Self-Organizing Map (SOM) was first introduced by Kohonen (Kohonen, 1997). The

SOM is modeled after neurobiological structures. The SOM takes advantage of both clus-

t ering and projection met hods and offers excellent visualizat ion capabilit ies and techniques

to compare input data items. The robust properties of the SOM make it a valuable tool in

data mining-

2.3.1 The Architecture

The brain cortex is arranged as a two-dimensional plane of neurons. Each neuron is a ce11

containhg a template that is used to match data patterns. The cells compute distances

between their tempIate and the input patterns. Cells with the closest match produce an

active output- These distances will be used for representing the of multi-dimensiond data

on a two dimensional plane of neurons. Kohonen also uses a topology similar to the braie

The output layer can be linear for the one-diniensional, in some form of grid for the two-

dimensional case (Figure 2.2), etc-

O r n P u T U Y E R

XI x2

INPUT LA YER

Figure 2.2: Self 0rg;rniaing Map

Kohonen's network consists of an array of nodes connected to each other based on some

t opology (e-g. rectangular, hexagonal) in one or two dimensional space see Figure 2.3. The

interconnections between the nodes only define neighbourhood relations and no weights

are assigned to them. As a result, these connections do not directly infiuence the learning

process in contrast to other types of neural network models. Each of the nodes forms an

output unit by having a weight vector (of the same dimensionality as input vector) assigned

to it. Input and weight vectors (2 and Si) are respectively denoted by

where n is the dimensionality of the input vector.

00000

Hexagonal

Figure 2.3: Neighbourhood Topologies

Weights are initially randomly generated without considering neighbouring nodes. Dur-

ing the training process these weight vectors are adapted in such a way that the topology

in the input spaces matches the topology in the output space.

2.3.2 The Learning Algorithm

The leaniing process begins by the cornpetition among the neurons. When an input i?

arrives, the neuron that is best able to represent the input wins a cornpetition and is

allowed to learn it even better. Considering the topology of the neighbouring nodes, not

only the winning neuron but also its neighbours are allowed to learn. Neighbouring units

will gradudy âdapt, during the training procedure, to respond to similar inputs- When the

training is finished, similar inputs are grouped to arrange clusters which can be represented

on the map. This is the essence of the SOM algorithm.

The weight vector di represents the t y p i 4 input for each neuron O. The unit whose

weight is nearest to the selected input x is the winner. The state of each unit with respect

to input Z(t) is calculated using an activation function based on the Euclidean distance

between that input vector and the weight vector 6 at time step t . Equation 2.3 describes

the computat ion:

Next, the best matching - i t is selected as the winner by using Equation 2.4. The unit

with niinimal ( t ) wodd be considered the wi-nner.

c(t) : 77Jt) = mzn(qi ( t ) ) = rnzn(l1 G ( t ) - Z(t) 1 1 ) z 2

where c( t ) is a winning unit at time t .

Another popular activation function used is based on the multiplication of the input

vector Z and the weight vector mi. In this case, evaluation of the activation function and

winning units would be as following:

The winning unit and its neighbours change to represent the input by rnodifying their weight

vectors. The nwnber of units that Iearn and change their weight vectors depends on the

neighbourhood kernel rp,(t). cpci(t) is a decreasing function of the distance between the

winning unit and the other units- Weight changes depend on a time-varying learning rate

e ( t ) . The adaptation of weight vector mi(t) results in a new weight vector *(t + l ) , which

most likely WU be selected as the winning unit at a future presentation of the same input

TO guarantee that the learning process ends in finite tirne, the amount of change has to

decrease gradually with tirne. This can be done by selecting a function called the learning

rate. It starts with a relatively large number in the range [0,1] and ends with a value close

to O. An example function might be

t ~ ( t ) = ~ ( 0 ) . exp (--)

learn

where &(O) < 1 and leam defmes a parameter responsible for the reduction by time.

The neighbourhood kernel is used to describe a neighbourhood area of the winning unit

c- Based on the WTA (winner takes alï) characteris tic of SOMs, no t only the winning unit,

but also the neighbouring units (dependin g on neighbourhood kernel, cp, (t )) will updat e

their weights. A popular example for a neighbourhood kernel is:

cp,(t) shodd decrease to its minimum 1 at the end of the process. u( t ) is a time decreasing

function which is defined as

v(0) is the initial neighbourhood area, and the neighbour parameter is responsible for the

amount of reduc t ion.

In summary, the algorithm is as follows:

Algorithm 1 The SOM Alerorithm

1: for each unit c the weight vector di is initially set to be random and the neighbourhood

9,; to be large.

2: one input t-ector Z is selected fiom all possible inputs.

3: an activation function (Equation 2.3 or 2-5) is used to calculate the state of each unit

with respect to the selected input vector

4: the best matching unit is selected ushg Equation 2.4

5: the weight vector of the wuiner 6, as well as the weight vectors of all units in the

neighbourhood of the winner are adapted using Equation 2-6

6: the neighbourhood kernel is decreased as well as the learning rate

7: next input is presented (Step 2)

2.3.3 The S O M properties useful in data exploration

The leamhg process of the SOM gives it specik characteristics which are useful in data

exploration.

Ordered display: the use of a map as a display for data items is very helpfui.

Items are mapped to those units that have closest weight vector and surrounding

neighbourhoods have sirnilar items mapped to them. Such an ordered display of the

datasets can ease the understanding of the statistical structures in the datasets.

Visualization of clusters: the sanie ordered display could be used to demonstrate

the clustering density in difFerent parts of the dataset's space. The density of the

weight vector will reflect the density of the input samples.

Missing data: a frequently occurring problem in data exploration is missing compe

nents in data vectors- A SOM may handle this problem by using Equations 2.3 or 2.5

on available elements in the input vector and its relative weight vector.

Outfiers: outliers rnay result in major problems in data analysis. The map generated

by the SOM algorithm may be used to detect and discard outliers fkom the dataset.

In addition, an outlier will affect just one unit and its neighbourhood, not the rest of

the t rainïng samples.

The major drawback to the use of SOM network is its incapability to perform the

classification task- The output of the SOM network must be manudy labelled which

is an extra cost.

2.4 Supervised Learning

In every analytical system there exists some patterns which have desired responses which are

known. The patterns and their desired responses are called inputs and targets, respectively.

The target may be a class, in this case the task is d e d class.ification, or a continuous signal,

in which case the task is d e d regression. The goal of supervised learning is to predict a

mode1 or mapping that will correctly relate the inputs and their targets. To achieve this

goal, a teacher will help the system's learning procedure by defining the correct labels and

providing the final error for the system. Final error wlll then be used to optirnixe the

learning paramet ers (Figure 2.4).

Figure 2.4: Supervised Learning

The biggest advantage of supenrised learning is its ability to generate correct outputs for

input data patterns that are not part of the training set. The other properties of supervised

learning are their robustness to noise and capabilïty of handling niissing elements in data

patterns.

There are a few disadvantages to the use of supervised leaming. Supervised learniag

methods are not immune from sensitivity to badly chosen initial data and parameters in

the method, as well as slow learning speed. They need a large amowlt of data for trnining.

In addition, providing large amounts of labelled data for the learning method is costly and

sometimes impossible.

Depending on the information tbat the teacher carries, there are two approaches to

supervised learning. One is just based on the fiwt that the decision is correct or wrong

(reinforcement learning) and the other one is based on the optimization of a training cost

function where the least square error approximation plays a major roll. The following

section covers a brief introduction on the Back-propagation network, a supervised learning

system, used in this thesis.

2.5 Back-propagat ion

The Back-propagation, Mdtilayer Pemeptmn, is the most popular supervised neural net-

work that is based on the error correcthg method. Back-propagation has successfüUy been

used in many different problems- Given enough training data, appropriate initid condi-

tions and architecture, it has been shown that BP is capable of Iearning the mapping of

any function to sat isfactory accuracy (Haykin, l999).

The network consists of a set of units that are arranged logically based on input, hidden

and output layers with no comection inside the layers. The number of hidden layers may be

more than one. The input of each Iayer is the output of the previous layer. The connections

carry weights which siinïmarize the network's behavior and are adjusted durhg training

(Figure 2.5).

OUTPUT LA YER

HIDDEN iAYER

INPUT LA YER

Figure 2-5: Back-propagation Architecture

The operation of the network consists of two stages through different layers, the forurard

15

pass and the backward pass(back-propagation). In the forwanl pass an input pattern vector

5 denoted by

is presented to the network. As the input passes through the network, the activation input

to the next layer is the scalar sum of the product of the incoming vector with its respective

weights. The generaI formula for the activation input to a node j is

where wji is the weight comecting node i to j and a t i is the output from node i. For

example, for the fhst hidden layer this equation is

where h y is the activation input to the hidden layer unit j and n is the number of input

nodes. The output of a node fkom each layer j is calculated based on its activation input

where f denotes the activation function of each node. A fiequently used activation function

is the logistic sigrnoid,

Since the back-propagation network is trained based on supervised learning, each input

vector has a desired output vector which represents the classifkation for the input pattern.

During the backward pass, weights are adjust ed according to the error (difference between

desired output and actual output of the system). The weights between the output layer

and the layers below are updated by the generalized delta rvle

where wkj(t + 1) are the updated weights at t h e step (iteration) (t+l), and E is a learnuig

rate parameter. The term 6 depending on the layer might be calculated differently. For the

output layer nodes, 6 is calculated with respect to the rate of change of error and the input

to node k

where dk is the desired output for node k. Ln this stage the weight connecting the output

layer arid layer below, hidden, are updated. Weights for the hidden layer, j, and below are

updated using equation 2.15. The 6 for these layers is calculated with respect to SI, in the

output layer

The back-propagation algorithm is a gradient descent optimization procedure which mini-

mizes the mean square error between the network output and the desired output for all the

input patterns

The algorithm is srimmarized as foUows:

Algorit hm 2 The back-propagat ion Algorithm

I: Initialkation: assigning random value [-1,1] to the weights in al1 the layers-

2: Presentcation of training data: training the network by randomly selecting an input

pattern fkom training set.

3: Forwani pass: computing the function signals of the network by proceeding forward,

layer by layer. Calculating activation input to each layer using Equation 2.12. The

output signal of each unit is calculated using Equation 2.14.

4: Backward pass: computing the 6s of the network using Equations 2.16 and 2.17. Ad-

justing the weights of the network using Equation 2.15-

5: Iteration: applying stages 2, 3 and 4 on the trairiing samples, until stopping criterion

is met.

Once the network has been trained, the weights are saved to be used in the classifi-

cation of unseen data (test data). The capability of processing unseen instances is d e d

generalization. When the generalization performance of the network on test data is much

worse than its performance on the training data, the problem is c d e d overfitting. Over-

fitting is sometimes due to the fact that the training material does not sufEciently cover

the class space. The second reason might be a high degree of non-linearity in the training

data. In both cases, it is obvious that backpropagation ( s i d a r to other supervised learning

methods) is signiûcantly sensitive to the training dataset itself and its distribution.

Overfitting is a very well known problem. Recently, there have been several at tempts to

avoid this problem either by: (é) changing the training data, or (ii) chchanging the size of the

networks. (Muller et al., 1996) performed a detailed study on generalization of multilayer

feed-forward network. They use a function which is based on the number of training samples

and network parameters. They tested a higher order universal asymptotic scaltng law

on a training examples to obtain a general theory for the training curve and number of

samples. In Equation 2.19, eg is the generalization ability, rn the number of parameters of

the network and n the number of triiinirig samples. For multilayer feed forward networks up

to 256 weights, they demonstrated strong ove&ting for a small number training samples

I n. In this case, the generalization error was estirnated as -. As the number of samples

n 1

increases, the bend of the Iearning cuve becornes close to -. n2

(Baum and Haussler 19893 discussed the same ~rûblem. Their result can be applied to

all the multilayer feed-forward learning algorithms. They addressed the questions of when

a network may be expected to generahe and what the range of training samples should be,

based on the number of units and weights. The results of this study showed that the Iower

and upper bound for random training samples m are

and

1 where W is the number of weights, N is the number of the nodes and O 5 E 5 -. It is shown

8 if m is higher than Equation 2.21, then at least a proportion of 1 - e of the test examples

will be correctly classxed. On the other hand, if it is lower than Equation 2.20 the network

si@cantly fails to classify 1 - e proportion of the test samples.

2.6 Previous Work

Creating suflicient labelled training examples to learn reasonably accurate classifiers is time

consuming and expensive, since t hey typically must be labelled manually. This problem has

led several researchers to consider learning a l g o r i t h that do not require a large amount

of labelled data. (Castelli and Cover, 1995) provide theoretical proof that unlabelled data

may be usecl to improve classification. In addition they discuss the value of the labeUed

data and its infiunce on classification error. The use of unlabelled data can be useful in

reducing the cost of classification procedure since unlike labelled data, wilabelled data is

easy to obtain and pleutlful.

In (Blum and Mitchell, l998), an algorithm as well as several experiments are introduced

to demonstrate how unlabelled data may improve supervised learnuig. This approach is

called CO-training and is applicable under the following assumptions for each dataset:

each dataset is redundantly sufficient for classification

features in each dataset are separated into two disjoint sets

The key idea behind the algorithm is that it uses two independent classifiers instead

of one. Each classifier uses a different charxteristic of the dataset to do the classification.

Both classifiers are trained using the labelled data. This results in two incomplete classifiers.

Then, each classifier examines the unlabelled data to pick the most positive and negative

examples and adds them to the labelled pool. These predictions are combined to decrease

classification error. Cetraining has beeo tested on webpage datasets and accuracy has

shown improvements üp to 743%. Cetraining may be a powerful method, but it is not

always applicable. A large majority of the datasets have just a single feature and if they

have more than one, those features are not independent. In (Nigam and Ghani, 2000):

they refer to this problem and compare results of the CO-training with the Ezgectation

Muximization (EM) algorithm. While c~tr;sining and EM performances are very close and

they are both applicable under certain assumptions that may not be met in all datasets.

(e.g. EM is based on the assumption of word independence which might be violated by the

text data) - These deficiencies lead the authors to the construction of an algorithm called

CO-EM, combination of CO-training and EM, which results in lower errors.

In (Virgiuia, l994), a completely unsupervised approach was applied on the Perterson-

Barney vowel dataset. In tbis approach, it is shown that an appropriate classifier may be

leanied without having any signal or labels. The algorithm is called self-supervised. In th%

algorithm, labels are assigned to codebook vectors using the k-nearest neighbour d g o r i t h ,

after they have been randomly chosen. Their labels are wed as the labels of the data

examples. These weights are updated through out the process. This algorithm is applicable

for tasks in which signals for two or more modalities are available.

A different approach to the use of unlabelled data has been introduced by Dale Schuur-

mans, (S&uurmans, 1 997), on optimization of the standard mode1 selection (a mechanism

to balance between hypot hesis complexity and data-fit) . This approach t akes advantage

of the distribution of wilabelled data, in order to investigate whether the true distance

between any two chosen sequences of hypotheses is violated (far from true distance). A

suitable distance under a predehed distribution of unlabelled data is used to estimate

the true distance. This method has been tested on polynomial cuve-fitting, which shows

significant improvement comparing to previous approaches.

EM is a popular technique and has been used in severd studies that combine labelIed

and unlabeUed data. Since the naive Bayes classifier suffers from high variance in insu£ücient

labelled data, in (Nigam et al., 1998), a combination of EM and this classifier are used to

overcome the problem. In this method, the naive Bayes classifier is used to make an initial

classifier using labelled data. Then, EM is applied to assign probabilistically weighted

labels for unlabelled data. EM finds a local m h u m likelihood parameterization using

both labelled and unlabelled data. The experimental results of this method on real world

datasets such as WebKB, News Groups, and ModApte demonstrate up to 33% improvement

in classification error,

In (McCallurn and Nigam, l999), a complete unsupervised approach is used on the task

of learning a text classier. This technique is applicable to text datasets when tbere are few

keywords per class and a class hierarchy is available. They use a bootstrapping algorithm

which is a combination of EM and hierarchical shrinkage. In their approach, keywords

are used to generate prelimhary labels by term matching. Then, EM uses these labels

and the keywords and class hieratchies to reassign probabilistically weighted class labeIs to

unlabelled data. Classification is further improved by ushg s h r ' i g e (a statistical mode1

for improving parameter estimation for sparse data). Experimentd results on web computer

science topics show an accuracy close to human performance.

In (Shahshahmi, l994), E M and a mixture of Guassians is used to investigate the effect

of unlabelled data on improving feature extraction, classification and class statistics in a

remote sensing application, EM is used to find Maximum Likelihood (ML) on both labelled

and unlabelled data. The starting point for ML estimates are obtained by training examples.

2.7 Conclusion

In order to identify the problem and its proposed solution addressed in this thesis, this

chapter provided an overview of all the riecessary background information such as: super-

vised and unsupervised learning, SOM and BP networks, existing limitations in supervised

learnllig procedure and previous studies on the problem. In the following chapter, a novel

algorithm called (Guleph Cluster Class) is introduced. The problem (concerning the clas-

sification procedure) and its proposed solution will be discussed in detail. Furthermore, an

overview of the experiments as well as their objectives will be provided. A brief introduction

to the experimental procedures is provided by the example using the IRIS dataset. The

rest of the datasets, experiments and their results are discussed in detail in Chapter 4.

Chapter 3

Implement at ion

3.1 Introduction

In the previous chapter supervised and unsupervised learning were briefiy introduced as

weU as their advantages and disadvantages. Background information on BP and SOM was

provided. In this chapter, the purpose of this study will be discussed in more detail. The

proposed algorithm (GCC) will be theoret ically discussed. An overview of the experiment s

and their objectives will be presented as well as an example to clarify the experiments.

An important problem in supervised learning is the effect of insdicient training samples

in classification performance. This problem is common for most classification methods and

it is one of the reasons that these methods are costly or sometimes not applicable. In

practice, only a limited number of labelled training samples can often be obtained, since

they typically must be labelled manually.

Usually both the classification and feature extraction stages of each analysis are based on

optimixation of parameters that must be estimated using training sâmples. If the nuniber

of labelled samples is srna , both of these stages might s a e r from high variance in the

parameter estimates and the result of the whole analysis may not be satisfactory. Another

problem caused by small sample size is unrepresentative labelled data. The training samples

fkom neighbouring regions of data space may not be a good representation for the samples

of the same class in other regions.

The purpose of t his thesis is to examine a technique which reduces the problem caused

by insufEcient training data. The use of unlabelled data sounds reasonable since it is

easier to obtain. It has been proven (in both theoretical and practical aspects) that under

certain conditions, unlabelled data carry usefid information about the underlying function

((Castellï and Cover, 1995) and (Shahshahani, 1994)). Use of unlabelled data in the design

of classifiers could be useful:

O to reduce the variance of the parameters which results in better estimates

to obtain statistics that are more representative of the true distribution of the samples

to obtain prior knowledge about the distribution and statistics of the dataset which

can further be used in the classification process.

3.2 Implementation

This thesis is an examination of the use of unlabelled data in categorization problem as

well as the degree of satisfaction of using this technique on different rd-world datasets.

The specific approach that will be described in this thesis is based on a combination of two

well-known Iearning algorithms: the Self Organizing Map (SOM) and Bad-Propagation

(BP) . Since this approach deah with unlabelled data, an unsupervised learning network is

necessary to carry out the clustering task. The SOM has been selected due to its unique

properties. SOM is capable of doing both data and dimensionality reduction at the same

time without ushg prior information conceraing data distributions. In SOM, data is ordered

into units of a map in such a way that, in this ordered rnap, similar data lie close to

each other. The resulting map can be used for visualization of the data and will provide

information about the statistics and distribution of the data. In this approach SOM is

trained using a fhite number of labelled data to make an initial classifier. Subsequently, the

resultant ordered map dong with the initial classzer are used to assign labels to unlabelled

data so as to provide self-labelled data. The resdting self-Iabelled data will then be used to

reformulate the clusters and to provide training data for the chssification task. BP is used

for the classification procedure and testing the self-labelled data-

There are several assumptions underlying this technique: i) input data naturally falls

into clv.stzrc k t e a d of beirig distributed across the entire data space, ii) al1 data points

in these clusters correspond to a specïfic class, iii) there is a one to one correspondence

between dataset and classes, and iv) labelled data and unlabellecl data are from the same

distribution. It is necessary to point out that the above assumptions are common for most

approaches.

To ground the theoretical and practical aspects of this technique, and to provide a

background for the algorithm, it is necessary to define some notations.

3.2.1 Data

Consider a dataset X which represents the input data for the problem. The elements of X

are assumed to be vectors of real numbers with dimensionality n- X is denoted by

Due to the fact that it is a classification problem, each data vector is associated with a class

label. Consider L to be a f i t e set of labels that would be assigned to a training samples

according to the function

where 1 is a real world function with the god of system being to learn it to a satisfactory

degree.

The data set X is partitioned into two disjoint subsets, Y, labelled data, Y', unlabelled

data. For VZ E Y, l(2) is considered to be known and for VZ E Y', unknown. The input of

the GCC system could be:

just the labelled data Y, in this case the method is termed Conservative

union of labelled and unlabelled data Y U Y', this method is callecl Alliance.

3.2.2 Labeliing Process

As already noted, the SOM is used to assign labels to elements of the unlabelled subset Y'.

The term labelling process refers to the process of assigning labels to SOM nodes, which

resdts in assigning labels to unlabelled vectors as well. This process is done as follow. The

SOM consists of a set of nodes arranged in a two dimensional space

The SOM is trauied using an input dataset (either Alliance or Conservative methods).

During the training process, the SOM orders the map in such a way that the topoIogy

in the output space corresponds to the topology in the input space. The training process

is completed by mapping input vectors to the nodes. This process is represented by the

clustering function fc denoted by

If fi is a representation for the inverse of the clustering function f,, then f:(s) is the set of

input vectors associated with a particular SOM node,s.

If the node that an unlabelled vector has clustered to it is assigned with a label, then

the unlabelled vector may be assigned the same label under certain conditions. Unlabelled

data can be clustered to a node s where:

al1 the labelled vectors that have been assigned to that node have the same label, 1,.

In this case, the SOM node and all the previously seen udabelled vectors in node s

will be assigned to the label Zs (Figure 3.1).

where l€ ' ) presents a concluded labelling function for 2 E Y'. This process is referred

as fist order la belling-

all the labelled vectors assigned to node s do not have the same label. If that is the

case, then the label for that node cannot clearly be identified. This node is referred to

as a non-labelling node (Figure 3-1) and the labels for the unlabelled vectors clustered

to that node cannot be identified.

no labelled vectors have been clustered to node s, then the node is referred to as

an undefined node (Figure 3.1) and the label of the unlabelled vectors will remain

unkn0W-n-

The term ambéguous refers to nodes that are either non-labelhg or undefiraed.

In sllmmary, Y{'} a subset of Y' elements that have been clustered to non-ambéguow

nodes can be represented by

rectrngular neig h bourhood

cluater-clou 2& 1 undefined non-labelling

Figure 3.1: Labelling Process

Second-onier-Labelling refers to a process where the SOM is retrained with the original

labelled data in addition to the new self-labelled data. These self-labelled vectors can be

used to reorganize the SOM clusters the same way the original Iabelled data were used.

This process can be extended to an arbitrary depth. However, typically at a point where

no major changes can be seen in the amount of rexnrrining unlabelled data, the re-labelling

process is deemed to be terminated f point of diminishing return) .

3.2.4 Neighbouring

As was previously noted, the SOM has the unique characteristic of assigning similar train-

ing patterns to neighbouring nodes. The shape of the neighbourhood depends on the SOM

topology (eg. rectangular, hexagonal). This results in a special relationship between neigh-

bouring nodes that can be used to extend the ability of SOM to provide more labels for

unlabelled vectors. These labels will be assigned to the nodes that previously have been

comidered undefined. Consider S' C S where elements of Sr are assumed to be neighbours

of the node s = ( i , j)

The possibility of assigning labels to undefined nodes can be investigated by ex-g

neighbouring nodes. An undefined node s could be located where:

all the neighbouring nodes which have a label assigned to them all have the sanie

label. Then the undefined node would be assigned to that label (Figure 3.2). As a

result, any unlabelled vectors clustered to this node will be assigned with the same

label.

VZE Y A V S E S n fC(x) fC(x) = 1.. (3-8)

multiple labels are assigned to the neighbouring nodes (Figure 3.2). No label will be

assigned to it (umbiguous node).

all the neighbouring nodes are undefined. In this case, no labels are assigned to it-

The Neighbouring procedure can be repeated iteratively until there are no undefinecl

nodes in the neighbourhood of labelled and newly added self-labelled data.

Figure 3 -2: Neighbourhood Process

3.2.5 Classification

Similar to any other classification technique, BP s a e r s from poor estimation and unrep

resentative training samples. These problem have been addressed in (Baum and Haussler

1989) and (Muller et al., 1996). The objective of the studies presented in this thesis is to

overcome the difficulties in the supervised learning procedure (BP as an example) caused

by insuflicient training samples. Several methods of exploring Iabelled data from unlabelled

data have been int roduced in previous sections. Subsequent t O labelhg unlabelled data

to a desired degree, d, it is tïme to test the validity of the approaches by evaluating the

performance of BP tralned on the original labeLled patterns, as well as on the newly added

self-labelled data. BP is trained on input vectors wfiich are a subset of X. Elernents of this

subset are ail labelled and in particular the set U Y (where d is the desired degree of

re-labelling and y'Id3 is the self-labelled data) instead of the conventional approach which

was simply Y.

3-2.6 The GCC (Guelph Cluster Class) Algorithm

This approach is termed " Guelph Cluster Chss (GCC)" which was originally coined by

Stacey, Kremer, and Dara. Provicling required background information in previous sections,

the siimmary of the GCC is as follows:

Algorithm 3 The GCC Alnorithm

i: train the SOM using Conservatiue or Alliance method

2: apply first-order-la bellirag to obtain self-labelled data

3: add self-labelled data to original labelled data

4: apply second-order-la belling and neigh boaring methods to obt ain more seslabelled data

5: apply step 4 until no major changes could be seen in the remaining unlabelled data

6: train and test supervised learning network with the original labelled data and self la-

belled data

3.3 Initial labelled Data Selection

Most of the studies that apply unlabelled data to improve supervised leaming, use labelled

data for their approaches. Labelling is a costly process and it is very important to h d a

method to select the srnallest and the most effective data patterns as labelled data (active

learning). Unlabelied data may provide useful prior knowledge about the distribution and

statistics of the data which can be used to select labelled patterns. In this study, the

information fkom the clustering stage of the SOM is used to select initial labelled data with

the highest quality and lowest quantity. When training of the SOM network is complete,

nodes that had unlabelled patterns assigneci to them can be used for data selection by one

of the following strategies:

Method 1: one or more data from all the nodes.

Methoü 2: considering the density of data pattern in one region the number of

selected items may be different- If density in one region is high, smaller number of

data will be selected fkom that region, on the other band, larger numbers of data

patterns will be selected from low density regions.

Method 3: one or more fkom randody selected nodes. In this case the number of

selected data patterns should be specified. All the neighbours around the selected node

will be blocked (to make sure no more data will be selected from that neighbourhood.)

Each strategy has advantages and disadvantages that are important to consider during

the experiments. Resulting labelled data by Method 1 is sufEcient for classification pro-

cedure without using the GCC system. In this method, patterns are selected fiom al1 the

nodes in input space. The resulting labelled set would have the same characteristics as

the original dataset and classification performance rnay be at its highest level. A major

drawback is the large number of labelled data that are selected using this strategy. In the

second strategy, not ody the size of selected labelled data is much smaller, but also the

dataset is highly effective when using the GCC system. This approach is not effective when

the data is scattered all over the input space. Method 3 has the advantage of reducing the

number of selected labelled data to the desired degree. Because of randorn selection in this

technique, selected labelled data may not effective.

3.4 Experiments

GCC is applied on six existing datasets, since there were no established benchmarks for the

use of unlabelled data for classi£ication. To explore different aspects of the GCC system

and its abrlity on using unlabelled data, these datasets are selected based on their different

statistical properties (dis tribution, distance relationship of the classes). In order to estimate

the accuracy of the system in each step, the chosen datasets are all labelled and divided

(randomly with a similar distribution) into three portions: labelled, unlabelled and test

data- For dabelled data, the labels are set aside and never used during the process-

The number of labelled data are reduced to make the classification problem more di£Eicult-

Several sets of experiments are performed on GCC. Each experiment has been designed

with some objective which will be addressed later in its correspondhg section-

The experiments were as follows:

Experiment #l: explore the applicability of GCC on different datasets

Experiment #2: test the effectiveness of the Consemative and Alliance methods

Experiment #3: test the effectiveness of the second-order-labelling and neigh bouring

methods

Ezperiment #4: explore the GCC system performance on selected labelled data and

selection st rategies

3.5 An Example

In order to highiight the problem and investigate the effect of proposed solution and methods

in previous sections, all the experiments have been executed on six benchmark datasets.

Each one of these datasets represents a different statistical data space. This section presents

the results for the (IRIS) dataset (as an example) to c l a r e and ease the understanding of

the experiments .

The IRIS dataset (Fisher, 1936) is a well know benchmark dataset that consists of four

characteristics of iris plants and classifies them into three classes of iris with 50 exemplars in

each class. One class is linearly separable from the other two which are not linearly separable

fiom each other. BP networks can be trained to over 90% accuracy with 75 examples in the

training set and 75 examples in the test set. As was previously mentioned, the SOM offers

excellent visualization capabilit ies and techniques to compare input data items. To analyze

the performance of the GCC system based on different datasets, the SOM is used to obtain

prior knowledge about distribution of the data and statisticai relationship of classes- This

knowledge can be obtained by training the SOM network with the training datasets and

plot ting classes separately. Figure 3.3 is a schematic representation of the IRIS data space.

For this dataset classes are clustered h t o different regions with few overlaps.

7 I

duster 3 + ,Cluster 2 X

6 - Cluster 1

MAP SIZE (X dimension)

Figure 3.3: Iris Dataset Distribution

In Experhents 1, 2 and 3, datasets are randody (without using prior knowledge)

divided into labelled, unlabelled, and test datasets. During the experiments, all the labels in

unlabelled data are discarded and not used during the procedure. However, it is important

to note that in the analysis stage of the GCC, prior knowledge about each node, each data

pattern and its tnie class are used to ded with them individually. This consideration helps

to collect more information about the types of errors that happen in the system. In addition,

to estimate the accuracy and performance of the system in each stage this information is

necessary to make a Confwion Matriz-

During the GCC procedwe, fint, the SOM is trained using either the Alliance and

Conseniative methods. Then, during the Labelling pmcess nodes are assigned a label. Un-

labelled data patterns that are mapped into labelling nodes are assigned a label and wllI be

added to the labelleci data to be used in the second-order-relabelling process. Those unla-

belled data patterns that have been assigned to ambigvous nodes will remain wiclassified

and will be passed to the next level for further processing- Tables 3.1 and 3.2 represent ex-

amples of confusion matrixes that are used to evaluate the GCC system in each step. These

tables show system accuracy and the ability to producing self-labelled data coniparing the

Alliance and Conseruutive methods which are based on first-order-labelling without using

the Neighboukg procedure. Please note that the size of the datasets (labelled, udabelled,

and test) and the network parameters for the experiments are given in the next chapter.

Table 3.1: Conservative method

Labelling ability = (l5/66) =%23

Labelling accuracy=(15/19)=%79

Class

1 2

3

Table 3.2: Alliance method

Labelling ability = (21/66)=%32

Labelhg accuracy= (2 l/23)=%9l

1

Tt is clear that some of the new self-labelled patterns have been mis-classXed (for exam-

ple in the Alliance method (Table 3.2), two unlabelled patterns are mis-classiiied). The data

patterns that belong to the vndefined column will be passed to the second-or de^-rela belling

stage. For each stage, the labelling accuracy and ability are separately calculated to fol-

2

8 0 0 1 4 0 0 2 2 0 0 2 7 1 3

3 undefined Total

22 22 22

low up the performance of the system. The GCC results CR the self-labelling procedure

are sirmmarized in Figures 3.4 and 3-5. These figures represent the results for Experiment

3, however, they can represent the overall performance of the GCC system, Alliance and

Conseruutive methods.

Initial stage

Methods

Figure 3.4: Self-labelling accuracy

Alriance - Conservatke ---------

Methods

Figure 3.5: Self-labelling ability

A labelling accuracy of 79% in the Conservutive method and 91% in the Alliance

method, demons trates that the Alliance method reduces the number of mis-classifications

(Figure 3.4). The same conclusion is true of the labelling ability of the GCC system (Fig-

ure 3.5). Moving from the initial stage, kt-order-labelling, thou& the use of Neighhr-

ing, second-order-labelling and Neigh boufing, the syst em's ability t O produce self-Iabelled

data increases (23% to 59% to 87%). In all the steps, the Alliance performance is higher

than the Consemative. These improvements are in aspects of both quantity and quality of

self-labelled data- These results are discussed further in the next chapter.

After assigning labels to unlabelled data to a desired degree, back-propagation is used

to test the GCC approach by evaluating BP performance on the original Iabelled patterns,

as well as on the newly added self-labelled data. The BP performance on the original

labelled data is considered to be a baseline upon which to make cornparisons against the

increasing amount of training samples generated by progressively adding proportions of

self-labelIed data. In this section, BP is trained and tested four separate times on IabeUed

data and proportions (randomly divided) of self-labelled data. The results are siimmarized

in Figure 3.6. It is clear that by adding more self-labelled data, BP show higher accuracy

on the test data. For example, for the Alliance method with 9 labelled data (73%), test

performance increases to 78%, 80%, 88% and 91% by adding self-labelled data.

O 10 20 30 40 50 60 70 80

Labelled+Setf-labeiled data

Figure 3.6: BP performance on Test data, IRIS dataset

In Experirnent 4, the benchmark datasets are only divided into testing and training

sets and labels for the training set are discarded. Then, the SOM is trained on training

data and the resulting clusters are used to select the labelled and uilabelled data. As w a

previously mentioned, selection can be done with different strategies (Methods 1 & 2 & 3).

When selection is complete, the true labels are assigned to the patterns to produce lableled

data. The number of labelled data obtained in Method 1 is large (34) for the IRIS data

compared to 9 in the previous experiments. It is important to note that one of the major

goals in this experiment is to reduce the cost of producing labelled data by selecting the

most effective, and at the same time smallest amount of labelled data. The same methods in

Experiments 1 and 2 (Alliance, Neighbourhood, and Second-order-Relabelling ) are applied

on the resulting labelled data. A summary of BP performance on these datasets is given in

Figure 3 -7. Met hod 1 is a powerful technique which results in a lower mis-classified patterns.

However, comparing the 34 labelled data obtained in Method 1 with 17 in Method 2 and 7

in Method 3, as well as the classification results presented in Figure 3.7 (92%, 92%, go%),

shows that the use of Method 1 for some of the datasets may not be necessary. The Merence

between the highest performance with 7 labelled patterns and 34 labeUed patterns is 2%

which can be ignored when the cost of labelling is five times lower-

40 I I I * I 1 I

O 10 20 30 40 50 M) 70

Labellad-eSelf-labelled data

Figure 3.7: Test ing performance (BP network) on selected IRIS labelled da taset

In conclusion, it can be said that the GCC system was capable of producing sacient

self-labelled data to improve the classification procedure for IRIS data. The rest of the

experiments will be discussed in the next chapter.

3.6 Conclusion

It wodd have been interesthg to find a way to investigate to what degree the function

is an accurate representation of the original function 1, and as a result of that, to examine

the effectiveness of this approach and validity of the labels assigned to dabelled data. In

practice, the effectiveness of this technique could not be evaluated based on mathematical

concepts. Basically, the performance of the system is highly dependent on the clustering

accuracy of the SOM for each particular dataset. Depending on the statistical relationship

of the classes in each dataset, they might be clustered into separate parts of the SOM

output space or with a high degree of overlap. As a result, there is a risk of the occurrence

of incorrect label inference (mis-classification) for the unlabelled data- The probability of

mis-classification decreases with the increasing nuniber of labelled vectors clustered to each

node (according to the law of large numbers).

The goal of this research as well as the problems were discussed in previous sections.

In addition, a novel algorithm (GCC) and its theoreticd and practical aspects were ex-

plored dong with an example. The following chapter contains detailed information on the

experiments such as: the size of the datasets, the network parameters and the rest of the

results. In Chapter 5, advantages and disadvantages of this study and proposed solutions,

mis-classScaiion and system performance on dXerent datasets will be analyzed.

Chapter 4

Experiment s

4.1 Introduction

In the previous chapter, the theoretical aspects of the problem, proposed solutions and their

objectives were ex;rmined. To validate these solutions, a number of experiments were exe-

cuted on several benchmark datasets. Each experiment was designed wit h some objectives.

Some asswnptions have been made during each simulation, and each dataset was chosen for

a speci.6~ purpose. Al1 these issues will be discussed in detail in this chapter. The empirical

results for the IRIS dataset were presented in Chapter 3 to provide a general understanding

of the simulations. The rest of the results and their andysis will be presented in this and

following chapters-

4.2 Data Collection and Descript ion

To investigate the performance of the Labellng, Neigh bouring and Re-labelhg procedures,

each dataset was selected in such a way that each one represents a difFerent statistical

characteristic. The resdting information can later be used to predict the performance of

the GCC system on other datasets.

Data Description

Eschenchia coli is one of the members of the coliform bacteria group normally found in

human and animal intestines, and indicative of fetal contamination when found in water.

Detennination of E. caii presence is often used to measure the microbiological safety of

drinking water supplies. A dataset collected by researchers in Agriculture Canada and the

University of Guelph consists of 228 samples of 13 inputs each, where the inputs are the

results of such tests as "tirne to detection after 3 hours exposure to acid" . In previous work

(Stacey, 1998) an 85% accuracy was achieved for a BP network trained with 50% of the

dataset and tested on the rem;ririilig data- This dataset serves as the bas& for the second

set of experiments.

The next dataset is Dr. William W. Wolberg's Wisconsin Breast Cancer data. This

dataset contains 699 samples with 458 (65.5%) samples in the class Benign and 241 (34.5%)

in the class Malignant. Each sample has 9 input features. Previous classXcation work

by Zhang (Zhang, 1990) achieved 93.7% accuracy using only 200 instances for tr;LiniTig a

1-neares t neighbour algorithm.

The Heart Diseuse dataset consists of 920 measurements of heart problem indicators

collected fiom the Cleveland CIinic Foundation. Each measurement consisted of 14 features.

Other researchers (Detrano et al., 1989) have reported a 77% classification accuracy with a

logis t ic-regression-derived discriminate function on t his dataset when training on two t hirds

of the data and testing on the remaining third. There are two categories: no heart-dls 1 ease

present (509 exernplars = 55%) and heart-disease present (41 1 exemplars = 45%).

The Mushroom dataset consisted of information about weU known mushrooms fiom the

Audobon Field Guide provided by Schlhmer (Iba et al., 1988). This dataset contains 22

attributes per exemplar with both quantitative measures as well as qualitative ones. There

are two categories of mushrooms considered: definitely edible, and definitely poisonous or

irnknown. There are 8124 exemplars: 4208 (51.8%) edible, and 3916 (48.2%) inedible. Iba

achieved approximately 95% classification accuracy on tbis set using 1000 instances for

training t heir HILLARY algorithm.

The fmal dataset is the National Institute of Diabet es and Digestive and Kidney Diseases

Database. It consists of 768 exemplars in two classes: 'Lpatient tested negative for diabetes"

(500 exemplars, 65%), and "patient tested positive" (268 exemplars, 35%). There are 8

attributes used for prediction, Using 576 training examples, Smith et al. (Smith et al.,

1988) achieved 76% accuracy, using their ADAP algorithm, on the r e m k g 192 instances.

Datasets Order Display

The SOM is an unsupervised t ool for aut omatically arranging high-diniensional s tatistical

datasets so that similar inputs are mapped close to each other. This map can be helpful in

exploration of the dataset by visualizing the data space, providing the desired information,

and revealing surprishg distance relations between different items of the dat ase ts. This

information may be used to predict:

How the La belling, Neigh bouring and Re-la belling procedures work.

The range of improvement in classification procedure using the GCC approach.

Which strategy (discussed in the previous chapter) is appropriate for selecting the

labelled data.

To ease the discussion and obtain a general view of the information contained in the

datasets, the resulting maps of the trained SOM (on training data, both labelled and un-

labelled) are presented here. Each map represents the data space and distances between

classes-

1 I 1 I r , Cluster 2 + Cluster 1 X

- * + + + + + + + + + + X W ++:< + X f 3- X Overlap :< X X X X X X + X X,Y :< 4- ;<

- X X X + X + ;< x + :< X X + + X + x :a:

X X X + X + * + X ;< ;< 7k + X X * ,Y t + + + >: x + + X X x * x

X * X + + * + X + + + % X X X + + + +

X + + + + + + + + + + X x + + +

% X + + + x X - * % + + + + + + x + x i- X , # I I

O 5 1 O 15 20 25

MAP SIZE. X dimension

Figure 4.1: E. Coli data

+ + x * s x : < : < X X

:< x x x ,'. :< . y

4- X X X X X X x X

+ X X X X X X X

m + x x x X + % X X X

x X X X X X x + + * : < 'X x x x x x x : < i:

x >< :< ;-; x :< + x I ( x ; - . . ,

,\ . .. I\

X 'X x :< + + * x x x x : - : X

+ ? K + X X X X % X X X X

16

s 1 a 15

MAP SIZE. X dimension

I 1 I I - Cluster 2 + Cluster1 :< '

Figure 4.2: Breast Cancer data

Figure 4.3: Mushroom data

20

c O .- tn 5 m E ü > g- IO V)

4 I

5

Fi,we 4.4: Diabetes data

l 1 1 L 1

Cluster 2 + Cluster 1 X

- 5 + + + + + + + ;< x x + + + + + + + + + + + + + + + +

1 5 - f f 4- + + f f X + ,Y . , /\

+ + + + + + x 2K + f % ;< x f f f + ?#$ t3verlap X >: X X + + X :< + X X

X * Y: - + + 4- 4- + x X x + + x X + + x X :< ;< X X X X + X X X X -. I

.4

- x x + X X X X 2: x x X + :< X + + + + + + + + + + + X X X + + + + + + X X X X x

O - + + i-+ -t+- i -+++ X x X x X I

O I 1 1 I ,

5 10 15 20 25 MAP SIZE. X dlmension

t - Y +

, , x ; H- , * , x * +-, d . - O 5 10 15 M 25 30 35 40 45

MAP SIZE. X dlrnension

Figure 4.5: Heart Disease data

Datasets differences can be recognized by comparing Figures 4.1 to 4.5. For example, in

Figures 4.2 and 4.3 (Breast Cancer and Mushmom), the classes are clustered into separate

regions of the data space with the exception of a few overlaps. On the other hand, in the

E. wli, Diabetes and Heart Diseuse cases (Figure 4.1, 4.4 and 4.5), the classes are hardly

separable and they have many overlaps. The use of this information will be pointed out in

the appropriate following sections-

4.3 Experiments

Four sets of simulations were executed, each one with a difFerent purpose. The objective

of the first three experiments was to examine the GCC (Guelph Cluster Class) approach,

its variations and their performances on di£Ferent datasets. The Iast experiment focuses on

labelled data selection and difFerent strategies for selecting the most effective labelled data.

The selected datasets were al1 labelled. To run the experiments, they were divided

into training and testing data. Training datasets were later partitioned into labelled and

unlabelled data- All the selections were random with similar distributiom- For unlabelled

data, the labels were set aside and never used in the process. Beside some parameters

that have been varied during each experiment, there were also parameters wi th fixed value

that were chosen based on experiences in the k t experiment. For example, the labelled

datasets selected for Experiment 2 and 3 were selected based on preliminary results obtained

in Experiment 1. The major effort during random selections in experiment was to reduce the

number of Iabelled data in such a way as to make the classification problem more dif6cult.

Evidence fiom Experiment 1 was usefd in the selection of SOM and BP parameters

during Experiments 2 and 3. SOM and BP parameters values were &O set based on exper-

iments in the literature. Since BP is a very popular neural network, guideline references for

its paramet ers can be w i l y obtained by the reader. Detailed guidelines for how to actudy

choose SOM parameters are given in several publications by Kohonen and Kaski (for exam-

ple (Kohonen, 1997)). They contain information on computing the proper maps- Topology

preservation of the input space is quite dillicult to define. Two different approaches for

measuring the degree of topology preservation (by SOM) are reviewed in (Kaski and La-

gus, 1996). Prior knowledge about the data can be used in choosing the SOM features,

(Joutsiniemi et al., 1995). Scaling of the data is very important before applying the SOM

algorithm ((Kohonen et al. 1996, ) and (Kaski and Kohonen, 1996))-

4.3.1 Experiment #1

This section covers detailed information about the size of datasets, SOM and BP parameters

and selection of labelled data, on several prel;m;nary simulations that were carried out on

the benchmark dataset. The main objectives of this experiment was i) to randomly select

and reduce the number of labelied data to make the classification problem more difEcult,

ii) to examine the overall performance of GCC on different datasets.

For each dataset, the data were randomly partitioned into training data and test data.

Shen, the training data were again randomly subdivided into labelled and unlabelled data.

For the unlabelled data, al1 labels were discarded. The sizes of the labelled, unlabelled and

test datasets are given in Table 4.1.

Table 4.1: Number of Labelled, Unlabelled, and Test data

1 E. wli I I I I I l

32 106 1 90 11 228 1

Data Sets

IRIS

1 Diabetes 11 65 1 447 1' 256 11 768 1

Test

75

Breast Cancer Heurt Disease

TabIe 4.2: Size of Classes in Labelled and Unlabelled data

Labelled

9

Total

150

1 Breast C a n F I

Unlabelid

66

50

65

1 Mushroom 1 1 Xeart Disease 1

389

547

Labelled Num Class 1 I Class 2 1 Class 3 1

260 308

Unlabelled Num

669 920

Class 1 1 Class 2 1 Class 3 1

AU data for labelled and udabelled data were selected randomly, however, a bdanced

number of samples fiom each class was maintained (Table 4.2). A BP network can perform

very accurately even with a small number of training data if the data contains suflicient

information. Consequently, a large number of labelled datasets were selected at the begin-

ning of this experiment and they were gradually reduced to make the procedure harder.

This reduction was continued to the point where the BP network was not able to perform

very well with only the labelled data. f t is important to note that since a l l the selections

were random, this procedure was very tricky. The distribution, intraclass and interclass dis-

tances, and other statistical properties of the selected patterns were unpredictable as weil

as the GCC system performance. The resulting labelled datasets were used in Experiments

1, 2 and 3.

Selection of network parameters was based on d g different simulations and the

existing literature. For each network, BP and SOM, parameters were varied for Werent

datasets. However, they were fked during the experiments on a speciûc dataset. These

h e d parameters were chosen based on the optimal result from each network (Table 4.3).

During the simulations, training and testing results were based on an average of four trails-

Since no great variation in performance on particular datasets was obsemed, ody four triah

were conducted and efforts were directed to examinhg a variety of difTerent datasets-

Similar to the example given in the previous chapter, a Conjùsion Mat* was used

to calculate the accuracy of the procedure for each step. To examine whether the GCC

procedure c m improve classification, the resulting self-lableled data were randomly divided

int O different proportions (wit hout using prior knowledge such as: distributions, classes,

etc.). Then, training data was increased by adding these sets of self-labelled data. Since

the baseline training data for BP is very srnall (original labelled data), its training tirne has

been set to a small value (Table 4.3) in order to avoid over training as the training data

was gradually increasing-

Table 4.3: Networks Pararnet ers

Networks BP Datasets L.R. 1 Input Num Output Niim Hidden Num Epoch

IRIS 0.1 1 4 3 3 400 E. coli 11 0.05 1 13 1 - 2 1 5 1 &O0 Breast Cancer 0.1 9 1 3 200 Mushroom O. 1 125 1 3 200

7

SOM L.R. Dimension Radius Epoch

1

i 0.1 8x4 5 iO000 1

Results of this experiment demonstrate that the GCC procedure can successfuily im-

prove the classification for different statistical datasets. The E. coli dataset was one of

the most chdenging datasets used for experimentation- (Figure 4.1. demonstrates t hat the

interclass distances between its items are small and hardly separable-) 32 labelled s m -

ples from this dataset were selected as training data. As cari be seen in Table 4.4, GCC

was only able to achieve a maximum 63% ability to self-label with an accuracy of 77%

(20 mis-cIassified data). Thus it would be assumed that GCC was not going to be able to

improve classification. However, on examination, (see in Figure 4.6) it can be seen that the

BP network trained with the labelled data (32 samples) and the self-labelled data (total of

118, with 20 mis-classified data) has achieved a maximum classification accuracy of 73% on

testing. The BP performance on the orighal labelled data was 57%.

Another example of a different dataset was the Mushmom dataset. 90 labelled smples

were used for training. Because of the distribution of the two classes in this dataset (Figure

4.3), labelling ability and accuracy are very high, 96% and 99% (Table 4.5). 99% for

labelling accuracy represents GCC7s ability to obtain correct self-labelled data when using

t his kind of dataset. By gradually increasing the self-labelled data in training, the accuracy

of the trained BP network jwnped from 53% to 95%. These results are shown in Figure

4.7.

Table 4.4: Confusion Matrix for Unlabelled data (E. coli)

Class

1 2

Labelling ability = (66/104)x100=63%

Labelling accuracy= (66/86)x100=77%

1

36 12

Total

52 52

2

8 30

undefined

8 10

Table 4.5: Confusion Matrix for Unlabellecl data (Mushroorn)

Labelling ability = (51 12/5303)x100=96%

Labelling accuracy= (5 ll2/5143)xlOO=99%

Class 1

GCC was capable of improving the classification procedure for all the datasets. De

pending on the dataset, the range of improvement was dif3erent. For example, in Figures

4.6 and 4.7, it can be seen that the range of improvement for the Mwhmom dataset was

significantly larger than for the E. coli dataset. The same conclusion can be made about

the range of improvement in other datasets. Results for the rest of datasets are presented

in Appendk A.

1

2530

2

25

undefined

3

1 Total 1 2597

I 1 I I I 1 l r test data -

30 40 50 50 70 80 go ioa 110 120

Labelled+SeM-labelled data

Figure 4.6: GCC ability improving classXcation of E. coli dataset

I I

tekt data -

O 1000 2000 3000 4000 5000 6000

Labelled + Self-labrslled data

Figure 4.7: GCC ability improving classification of Mushroom dataset

4.3.2 Experiment #2

Now, it can be said that the GCC approach is capable of using unlabelled data to improve

supervised learning. However, th& improvement is highly dependent on the quantity and

the accuracy of self-labelled data- The main objective of this experiment was to improve the

quality and quantity of the self-labelled data which hopefully would lead to an improvement

in overall classXcation accuracy. This objective was tested by training the SOM using two

dXerent datasets: (i) a combination of labelled and unlabeLZed data (Alliance), (ii) only

labelled data (Conservative). The size of the training sets and their classes are given in

Tables 4.1 and 4.2 in the previous section. The same strategies used in Experiment 1 have

been used in this experiment. The Conservative technique's parameters were the same as

those given in Table 4.3. In the Alliance technique, the network parameters are siimmarized

in Table 4.6.

By using the Alliance method, it was expected that there would be an improvement

in the accuracy of labelling ushg information on both labelled and unlabelled data. Table

4.7 surnmarîzes the results of the Alliance and Conseruatéue methods in both SOM and

BP networks- As was expected (in the first two columns of Table 4.7), SOM was able to

increase its labelling accuracy using the Alliance procedure. For exampIe in the E. coli and

Diabetes datasets the range of improvements was 8% and 7%. These improvements were

in both aspects of quantity and accuracy in the production of the self-labelied data. Please

note that no great variations in the performance of each network were seen.

Table 4.6: Network Parameters for the AIIiance method

1 Networks BP SOM Datasets

IRIS E. coli Breast Cancer Mushrootn

L.R.

O .2 0.005

0.2 0.2

Input Num

4

13

Diabet es

Hart

9 125

Output Num

3 1

0.01

0.01

1 1

Hidden Num

3

7

8

35

3 4

-pPp.---p.

Epoch

400

1000

1

1

150 150

L.R.

0.1

0.1

8

5

0.1 0.1

Dimension

8x4

20x1 5

1000

1000

18x15 30x25

Radius

5 18

0.1

0.1

Epoch

4000 10000

15 30

40x35

40x35

da00 1000

35 3 5

10000

9000

Table 4.7: Alliance and Conservative Results

Data Sets

Despite this improvement in the labelling procedure supervised learnuig accuracy did

not increase (third and forth col^ of Table 4.7 demonstrate BP performance on test

dataset). Although the self-labeliing accuracy for ail the datasets increased, the BP net-

work's accuracy remained unchanged. Except for the Mushroom dataset ( Consemative 93%

and Alliance 98%). The reason for this exception was the 233 correct patterns that were

added as self-labelled data in Alliance procedure. For the rest of the datasets, improvement

of the self-labelled data was not significant.

IRIS E. Coli Breast Cancer Mushroorn Diabetes Heart Disease

As was previously discussed, SOM has the tendency of assigning simiIar data patterns into

neighbouring nodes. Since the partitioning of SOM nodes into labelling and non-labelling

nodes results in a srnall proportion of the nodes being used for the labelling procedure,

neighbours of the original nodes are considered as labelling nodes if sufEcient evidence

is available (Neigh bouring procedure). An alternative technique to increase the quantity

of the self-labeiled data is re-labelling. This experiment was designed with the objective

of testing the effect of the Neighbouring and re-Zabelling techniques on the GCC system.

In this technique, the SOM is retrained using the original labelled data as well as the

new self-labelled data. Please note that re-labelling is also referred to as sewnd-order-

labelling. In both methods, the probability of having mis-classified data will increase. In

SOM (labelhg accuracy) Alliance 1 Conservative 1

BP (test data) Alliance 1 Conservative

91% 60% 97% 96% 57% 75%

87% 52% 89% 92% 48% 68%

91% 73% 97% 98% 75% 79%

89% 73% 96% 93% 75% 80%

the Neighbounng procedure, neighbours of the labelling node rnay contains data fkom a

difEerent class. Another problem is that the resulting mis-classifieci data from first-order-

labelling may mis-lead the re-labelling procedure. These problems dong with the effect

of the techniques were investigated by conducting the appropriate experiments on all the

benchmark datasets. The size of Iabelled and unlabelled datasets and class composition in

this experiment are given in Tables 4.1 and 4.2. SOM and BP were trained with the same

parameters as shown in Tables 4.3 and 4.6.

In the initial stages of this experiment, the self-labelling process was performed without

using the Neighbouring or Re-labelling techniques. These results were used as a baseline

to evaluate the effect of these techniques in the self-labelling process. Comparisons were

done for both the Conservative and Alliance methods. The training datasets used for the

classification stage (BP), were the ones with the highest accuracy and largest number of

self-labelled data obtained from these comparisons. For example, if the highest accuracy of

self-labelled data belonged to the experiment using Neighbovnng and re-labelling techniques,

then the resulting self-labelled data (fiom this simulation) were used to train and test the BP

network. To siimmarize the results of this experiment, Diabetes and Breast- Cancer datasets

have been selected as a representative for datasets wit h different statîstical properties. SOM

performance on the self-labelling procedure is presented in Figures 4.8 and 4.10. Figures

4.9 and 4.11 are the summaries of the BP performarice on the test data.

I 1 I Alliance - Conservadve -----

- Initial stage Ndghbauring Nelghbourlng+Relabdling

Methods

Figure 4.8: Labeiling ability through difTerent techniques on the Breast-Cancer dataset

110 1 1 1 1 1 1 4 Alliance -

Consenfative -------

1 5 0 200 250

Labelled+SeH-labetled data

Figure 4.9: GCC abiLity on irnproving classification of Breast-Cancer dataset

Initial stage Neighbourlng Methods

Figure 4.10: Labelling ability through different techniques on the Diabetes dataset

1 1 I , 1 r I I r Alliance -

Ccnservative -------

LabelledcSetf-labelled data

Figure 4.11: GCC ability on improving classification of Dia betes dataset

For the Breast Cancer dataset, 50 labelled samples were used for training. As the self-

labelling process progressed fiom the initial stage through just the neighbourhood use and

to the stage of the use of both neighbourhing and re-labelling methods (Figure 4.8)' the

GCC system was able to gradually increase the quantity of the self-labeiled data (fiom 38%

to 62% to 91% for the Conservative method and fkom 47% to 75% to 97% for the Alliance

method). This self-labelling was extremely accurate (high 90's). When the BP network

was traiued with the labelled and self-labelled data, accuracy of the trained network is

comparable to systems trained with more data (Zhang used 200 samples compared to 50

samples). These resdts are sirmmarized in Figure 4.9.

The Diabetes dataset represents one of the most challenging problems for the GCC

system. 65 labelled data were used for training. GCC was only able to achieve 60% ability

in the self-labelling process (Figure 4.10). The BP network trained with only 65 samples and

self-labelled data achieved a maximum classification accuracy of 75% for the test dataset

(Figure 4.1 1). Since the selection of self-labelled data for different training sets was random,

the nuniber of mis-classified data in each BP network training data varied and it was not

controlled. A sudden drop in the BP network performance (from 70% to 67%), Figure 4.11,

was because of the mis-classifieci patterns that had been selected for the training set. In

conclusion, it can be said tbat even with some mis-classified self-labelled data, GCC is able

to produce sficient samples of properly classified self-labelled data to increase classification

accuracy for different datasets.

4.3.4 Experiment #4

Most studies use both labelled and unlabelled data to improve supervised learning. The use

of labelled data can be costly, since data samples have be labelled manually. Thus, it is very

important to select the most informative samples to be labelled. In previous experiments,

labelled data were randomly selected fkom an array of data patterns (ail labelled) without

using prior knowledge about the data or data space (there are problems encountered with

this selection method which WU be discussed in the following chapter). Instead of randomly

select ing the pat tems different paradigms for data selection have been introduced active

learning. In active learning, the samples are selected in such a way that they are expected

to be the most informative ones. The goal of active learning is i) to improve the efficiency

of the patterns with the aim of using less patterns and achieving the higher performances,

i.i) to improve the cost efficiency of data acquisition by labelling only those data that are

expected, iii) to facilitate training by removing redundancy from the training set. In the

following section a novel approach is discussed which encompasses the selection of labelled

samples by using the information £iom the clusterhg stage to facilitate and increase the

accuracy of the labelling procedure. This experiment was designed with the objective of

avoiding the problems in Experiment 1, improving quality and reducing quantity of the

labelled data.

In this approach, the training sets (Table 4.1) were used to train a SOM (all the labels

of these training sets were discarded) . Then, samples were selected based on la belling nodes

(nodes that an input vectcir had been assigned to them) by one of the following strategies.

Selecting one sample (randomly) from

al1 the labelling nodes, (Method 1)

the selected nodes based on the density of the data patterns in the data space (giving

the highest probability for selecthg fkom low density neighbourhoods than the nodes

in high density regions), (Method 2)

randomly selected nodes (when a node was selected, all its neighbourhood was blocked) , (Method 3)

To obtain the initial labelled data, selected ualabelled samples (£rom one of these strate-

gies) had to be assigned their correct labels. The remaining ualabelled patterns were con-

sidered to be unlabelled data. To compare these strategies the GCC system was examinec!

on the resulting labelled datasets. Similarly to the previous experiment, Alliance, Neigh-

bouring and Re-labelling procedures were applied to obtain self-labelled data. Then, a BP

network was trained and tested using the initial labelled data as well as the new self-labelled

data- The sizes of the labelled, unlabelled datasets and labeUed sample classes are given in

Table 4.8.

It is important to note that the objective of this experiment was to reduce the size of the

labelled dataset without reducing the GCC system performance. When progressîng through

the selection strategies, in Methods 1 to 3, the number of selected labelled data samples

decreased in each step (Table 4.8). For example, in the Breast Cancer dataset, the labelled

data ske changed fkom 110 in Method 1 to 33 and 22 in Methods 2 and 3. Even though the

labelleci data obtained in Method 3 was five fold smaller than the labelleci data in Method 1,

the GCC system was capable of achieving the same amount of accuracy on the classification

of the test data (high of 95%) as in Method 1. The range of improvement fiom the lowest

accuracy to highest in Method 3 is much larger (64%) than Method 1 (30%). Table 4.9

siimmarizes the results. In that table, for each Method, Column 1 demonstrates the number

of labelled data used in that method, Co1timn.s 2 and 3 show the lowes t Cjust original labelled

data) and the highest (labelled and self-labelled data) BP network performance on the test

dataset. In the Diabetes dataset, labelled data obtained in Method 3 is six fold smder than

Method 1, but at the same tirne, its accuracy (highest performance coliimriw) is only 7%

lower. Figures 4.12 and 4.13 demonstrate the results for the f i a r t Disease and Mushroorn

datasets which are very similar to results discussed for the other datasets. In conclusion,

selecting labelled data based on Method 1 may achieve higher accuracy, however, the cost

of labelling has to be considered. On the other hand, Method 3 results in a much smaller

dataset and reasonabIe accuracy with the GCC system.

l'able 4.8: Size of Labelled and Unlabelled data

1 Strotc~ies II Mcthod 1 1 Method 2 1 Mcthad 3 . 1 Data Sets

Q, B 1 1RlS

Labelled Class 1 1 Class 2

1 1 3 E. Coli Bread Cancer Mushroom

Unlabelled

3 1 68

Labellcd

25 57 4 8

Unlabelled

1 4 1 1

Class 3

13

Clzoa 1

7 3 5 53 52

Class 2

14

Labcllcd Ciam 1

3 78

329 6290

Clas 2 ( Chia 3

8 1 6 10 12 3 3

15 2 1 3 1

Table 1.9: Labelled Data Sclection Rcsults

Strritegies Data Sets

IRIS E. Coli Breocit Cancer Mushroorn

-

Method 1

60 110 100

Labcllcd data

34

Method 2

7 1 65 53

Method 3 Lowest %

90

Highcst %

92

Labclled data

7

Highcst % 92

Labclled data

17

73 95 99

Lowcst % 85

Lowcst %

66 25 33 64

Highest % 90

65 65 4 7

68 95 99

16 22 24

56 31 17

69 9 6 94

Figure

300 400

Labelled+Self-labelled data

4.12: Test ing performance (BP network) on selected Heurt-Diseuse labelled data

O 1000 MO0 30W 4000 SM30 6000

Labelled+Çetf-label1 data

Figure 4- 13: Tes ting performance (BP network) on selected Mwhmom labelled data

Conclusion

This research has shown how a self-organizing map can be used to produce self-labelled

data in sufijicient quantities and with sac ien t accuracy to enhance the training of a BP

network.

In preliminaq experiments, labelled and unlabelled datasets were randomly selected

based on the assiirilption that in real world applications, the practitioner has no control

over the labelled dataset. On the other hand, it was shown that it is possible to use a SOM

to seek out certain exemplars for Iabelling. If the labelling process is a costly process, then

information from the clustering stage can be used to assign labels to unlabelled samples in

the iabelling process. In addition, to validate the approach different datasets were tested.

Each dataset represented a di£Ferent data characteristic.

In conclusion, it can be said that the proposed approach appears to have significant

merit for integrating unlabelled data into the domain of supervised learning. Specifically,

datasets that have reasonably separable classes indicate significant improvement by adding

unlabelled data.

Chapter 5

Analysis and Discussion

5.1 Introduction

Given the results from the previous chapter, what can be concluded about the behavior of

the GCC system? Clearly, results show that GCC is capabIe of providing self-labelled data

wit h sufficient quantity and q d t y which can improve supervised learning. However, t here

are still lots of questions with respect to GCC performance. Why does the GCC system do

well? When does the GCC system do well? Could its behavior be generalized for different

datasets? If it can be generalized, what characteristics should a dataset have?

It is not an easy task to explain and generalize GCC system performance. Detailed

research, theoretically and ernpirically, is required to be able to answer some of these ques-

tions. This piece of work is basicalIy a stepping stone (an introduction) to this field and

it does not concentrate on answering al1 the above questions because of time limitations.

However, in this chapter, some of these questions are investigated over the performance

of the GCC system on the IRIS and E. wlo datasets. The IRIS set is selected as a toy

problem since it is a popular and small benchmark d a t ~ e t and easy to work with (Figure

5.1 shows the distribution of the IRIS data classes).

The key element in the GCC system that helps to improve supervised learning is the

use of unlabelled data in i;he procedure. In general, dabelled patterns contain information

about the problem (Nigam et al., 1998). They provide information about the joint probabil-

ity distribution over the items in the data space. In another word, unlabelled data show the

true distribution of the patterns. For example, using only a small set of the IRIS dataset as

labelled datato train the SOM network might result in a map such as the on in Figure 5.2.

With the knowledge of three existing classes in th% dataset, three major clusters can be

recognized on the map. Since there exist only three classes, the s m d cluster (represented

by "? ?" should belong to one of the major clusters. If the unknown cluster ("? ?") is

going to be recognized based on its distance from other clusters, then it belongs to Cluster

#3 (on the right hand side of the map). However, when unlabelled patterns are used in the

procedure (Figure 5.3), this estimation is violated. Distribution of the dabelleci data in

the data space (Figure 5.3) shows that there is a higher probability for the unknown cluster

to belong to Cluster #3 or #2. This simple example shows how unlabelled patterns can be

helpfd to avoid wrong estimations in clustering and as a result of that in the classification

procedure.

Figure 5.1: IRIS data ordered map (training data)

7

6

5

r .- : 4 - c

i?i .- 'O

2 3 - U1 N <I)

9 2 - z

1

0

-1

MAP SlZE (X dimension)

i 1 1 I r Cluster 3 f Cluster 2 X

- Cluster 1 X

- X :< X x x x

:< < x x * + + X * + + + :< x ;#

- + + + '4 X X

- + 3- + + *&wrap* x X

1 1 I

O 2 4 6 8 1 Q

K l 4 1 i

dusters (labled data) +

4 6


Figure 5.2: A sample labelled data distribution

7

6

5

Figure 5.3: A sample labelhl and wilabelled data distribution

I I I I

labded data f unlabelai data X

-

- + + + X ' x :< - C - =: 4 - 5

E e t 3 - W N CA

2 - P 1

0

- 1

x f + + + duster 2

+ + + FIUster

t + X x x + - , ,

,-, P- ' + x X x

- X + X ;< X x + ? * I I 1 I

O 2 4 6 8 10

MAP SIZE (X dlrnension)

In addition to distribution, other statistical characteristics of the dataset such as intr*

class and interclass distances play an important role in the GCC system performance- In a

dataset where the intraclass distances are small and, at the same time, interclass distances

are large, the GCC system and its variations' performances improve significantly. Based on

the results in previous chapter, performing selection on the labelled data, Neigh bouring and

Re-labelling techniques would be more escient (snialler number of labelled data and lower

number mis-classified) on the datasets with this type of distance relationship between their

classes.

In a research by Castelli and his colleagues, (Castelli and Cover, l995), they show that

labelled samples are exponentially valuable in reducing the risk and error in the pattern

recognition field, despite unlabelled data that can only be polynomially valuable. They

prove that infinite unlabelled data, alone, can only be used to estimate single component

distributions. It cannot be used to constmct a classification rule. With the use of a mixture

distribution function and an error model, they prove that the probability of error (in a

classification procedure) with no labelled data and infiriite unlabelled data is 1/2. However,

when the number of labelled data points increases, error reduces exponentially and when

there is an infinite amount of labelled data, almost all the components of the mixture

distribution function can be recovered. The size of labelled and unlabelled datasets shodd

be considered when using unlabelled data in supervised learning procedure. In a problem

with infiriite labelled data, wilabelled data does not aid the reduction of classification error.

If there is already a sufficient amount of labelled data, all the parameters can be retrieved

fkom just the labelled data and the resulting classifier is BP-optimal. The effect of labelled

and unlabelled dataset size in supervised learning has b e n discussed in detail in (Nigam

et al., 1998). The discussion on the performance of the GCC system and its variations will

continue in the following sections. Empirical results on the effect of labelled and unlabelled

dataset size and distribution are presented by an example.

5.2 Analysis of Experiment #1

In general, using a small set of labelled training data for ciassification, accuracy will suffer

because of hi& variance in the parameter estimation procedure. aowever, it is important

t O note t hat given appropriate training data, BP is capable of approximating any functions

to satisfactory accuracy. As it was previously mentioned, one of the objectives for this

experiment was to select and d u c e the size of labelled data in such a way to make the

classification procedure harder. The labelled data was randomly selected fiom the training

data. The rem;rining data was considered as unlabelled data. Random selection of labelled

data resulted in different and unpredictable performances for self-labelling process and the

GCC system. This rinpredictability could be caused by: infinite labelled data, the distance

relation between labelled data items, and the labelled data distribution (if it is a reliable

representative of data space or not).

The size of selected labelled data and its distribution were major problems during this

experiment. Very small data size did not necessarily make the classification procedure

harder. Sometimes, with a small set of labelled data, BP was capable of producing an

accurate classifier. On the other hand, a large number of labelled data that represented a

small region of input space or suffered ikom high variance made the classification procedure

harder where even the GCC system could not be useful. This problem is even worse when

working with small datasets such as IRIS or E, coli. For example, the IRIS dataset conta.ins

150 patterns which was divided &O 75 training and 75 testing patterns. Then labelled data

has been selected from 75 training samples. First 25 patterns were selected as labelled data

and BP accuracy was close to 90%. Then, the size of labelled data was reduced to 15 and 7.

The following maps show the distribution of 15 and 7 labelled patterns in the data space.

2 4 6


Figure 5.4: SOM map for 15 IRIS labelled patterns

4 6


Figure 5.5: SOM map for 7 IRIS labelled patterns

Considering the fact that there is only a difference of 8 patterns between the two datasets,

BP had an accuracy of 85% for 15 patterns ((Figure 5.4) and 67% for 7 patterns ((Figure

5.5). After using the GCC system, BP accuracy increased to 92% (for 15 patterns) and

89% (for 7 patterns). These examples show how sensitive the selection procedure can

be when it comes to small-sized datasets. This sensitivity was ma@ed when ushg the

different variant s of the GCC procedure (Neigh bouring, Re-la belling, etc.) and considering

the number of classes, their distance relation and distribution. On the other hand, the

Mushroom dataset was large which made the selection procedure much easier. In this

dataset, 90 labelled patterns were selected from 5393 training data. The only consideration

was to make sure that the 90 patterns cover both classes.

Another objective of this experiment was to examine the GCC system performance

on different types of datasets. Results for each dataset (in the previous chapter) show

that the GCC system is capable of producing sufficient self-labelled data with reasonable

accuracy and, consequently, improve classification for all the datasets. However, the range

of improvement for each type varies. For example, the improvement in classification in the

Breast Cancer and Mvshroorn cases is larger than the E. coli or Déabetes case. In addition,

in the Breust Cancer and Mushroom cases, the size of the SOM map as well as the labelled

data can be reduced (as shown in Experiment 4).

As was previously mentioned, self-labelling procedure does not always result to correct

labelled data. Unlabelled patterns may be mis-classified due to overlaps in the SOM'S nodes

or small interclass distances in the datasets. The proportion of mis-classified data (obtained

by the GCC system) varies in each datasets based on their statistical characteristics. In

all the experiments, the GCC system performance was tested by training BP network on

the difFerent portions of self-labelled data. The E. coli has b e n chosen as an example

to investigate the effect of mis-classified data on classScation procedure. Instead of using

random selection for different portion of self-labelled data, a specific number of mis-classified

patterns had been selected with each portion. Figure 5.6 shows a sudden drop (fkom 70%

to 46%) in the BP network performance by addbg 25 number of mis-classifiecl data into the

second portion of self-labelled data. Please note that, in each step, the number of correct

self-labelled data was increasing (but it was not shown on the graph). By increasing the

size of correct self-labelled data, BP network performance improved (nom 46% to 69%) in

the last step.

90

Mis-dassif,ed data

Figure 5.6: Number of mis-classified data versus BP network's performance on the test data

5.2.1 Analysis of Experiment #4

This experiment was designed with the objective of resolving problems that were encoun-

tered in experiment one. In the real worId, problems related to the selection of labeiled

data can be costly since labelling is a costly procedure. It is important to select the most

effective unlabelled patterns and the smallest size to reduce the cost active learning. In

order to achieve this goal the SOM nodes have been used for the selection procedure. Thee

strategies were designed for the selection, each one with difFerent properties (results are

presented in the previous chapter).

The first strategy is the most powerful and effective one among three strategies and

provides a large number of labelled data. However, the resulting labelled set would be a

large one , infinite (Castelli and Cover, 1995), which would violate the objective of this

experiment- On the other hand, with a large labelled dataset, unlabelled data does not aid

in the reductio~ of classification error and the use of procedures such as GCC would be out

the of question. With the use of the second and the third strategies, selected labelled data

can be reduced to a desirable degree. The second strategy is very effective and comparable

with the first strategy. Because of existing randomess in the third strategy, it is not

as efficient as the other ones. However, the number of selected patterns can be exactly

specSed.


The objective of this experiment was to improve self-labelled data quality. The SOM net-

work was trained using training data (unlabelled and labelled data) instead of simply using

labelled data. Since unlabelled data provides idormation about the true distribution of

the data patterns (the IRIS data example in introduction section), the expectation was to

observe improvement in both seK-labelling and classification procedure for all the datasets-

Table 5.1 siimmarizes the results for the self-labelling procedure in the Alliance and Con-

servatiue methods. The first col-!imn for each method is the size of undefinecl patterns that

were clustered to undefined nodes. The second colwnn shows the number of mis-classified

patterns and the Iast one is the labelling accuracy. For al1 the datasets, labelling accu-

racy was increased with the Alliance method. In the E. coli, Diabetes and Head Diseuse

cases (Table 5.1), improvernents were produced by reducing mis-classsed data instead of

producing new self-labelled data. In the O ther dat asets, labelling accuracy was increased

due to improvement in quantity of self-labelled data (reducing the number of mis-classi6ed

data). This happens due to the information that the unlabelled data provides during the

procedure. (Please note that the results presented in Table 5.1 are based on Neighbouring

and Re-la belling techniques as well.)

Table 5 -1: Self-labelling accuracy in Alliance and Conservative met hods

Training Data Sets

IRIS

Despite the expectation, BP performance on training data produced by Alliance and

Conservative methods was very close to each other except for the Mvshroom dataset. h

the Mvshroon case, BP performance on test data increases from 93% in the Conservative

method to 98% in the Alliance method. By comparing the number of mis-classified patterns

in the Consemative and Alliance methods (241 to 31) and newly added self-labelled data

(23), 233 correct patterns were added as self-labelled data in the Alliance procedure. It

is important to note that 241 mis-classified patterns (Conservative procedure) could have

mis-lead the BP network. That amount of mis-classifieil data was later reduced to 31 in the

Alliance procedure. This example, once again, validates the assumption of finite labelleci

data and infinite unlabelled data and its effect- on the reduction of classification error. The

Mushroom set was divided into 90 labelled and 5303 unlabelled data. It is obvious that

5303 unlabelled patterns can provide sdc ien t information to improve labelhg accuracy.

In general, it can be said that the Alliance method improves the self-labelling procedure

and it should be used in the GCC procedure.

E. Coli Breast Cancer Mushroom Hart Disease Diabetes


In addition to the Alliance and Conservative methods, Neighbouring and Re-labelling tech-

niques were other variants of the GCC procedure that were used to increase the self-labelhg

ab%@ of the system. Empirical results for this experiment illustrated that the combina-

Conservative

18 16 183 138

1 181

Unde6ned 1 Mis-dassified

1 6 1 2

Alliance Labelling accuracy

87

Undefined

2 32 16 24 1 37 51

Mis-dassified 2

I

52 89 92 68 48

Labelling accuracy 91

1 1

18 22 60 97 96 75 57

6 160 118 162

5 31 19 30

tion of Alliance, Neighbouring and re-labelling techniques could improve self-labelled data

to sac ien t degree ( pervious chapter) . However, each technique's performance varies de-

pending on the statistical characteristics or the size of the datasets. The self-labelled data

resulting fkom the Neighbouring method is highly dependent on the intraclass and inter-

class distances. If the dataset interclass distances are not swûïciently large, the self-labelling

process wiU not be accurate and may result to a large number of mis-classified self-labelled

patterns. Mis-classifieci data may later mislead second-order-labelling and the classXcation

procedure (BP network). In Table 5.1, results in the mis-classXed columns demonstrate

that the datasets with a small interclass distances such as (E. wli, Diabetes and Heurt

Disecase) have larger numbers of mis-classiiïed data than other ones (considering the size

of dataset). By investigating the maps for each one of these datasets (4.1, 4.4 and 4.5),

it can be said that the role of the initial labelled data to control self-labellïng accuracy in

the Neighbouring technique is very cruciaL Large SOM maps can result in a lower number

of mis-classified data and a larger number of undefined data which may reduce the risk

of misleadhg the classification procedure. Despite the discussed problems, results for this

experiment show that the Neighbouring method is capable of improving self-labelled data

quality and quantity. It is important to note that all the selections (labellecl and unla-

belled data) during this experiment were randorn without using prior knowledge. This fact

conûrms the positive effkct of the Neighbouring method on the self-labelling procedure.

Re-labelling is a very usefd method, however, one might wonder how many times this

method should be applied during one procedure- At this time, it is not possible to provide

theoretical evidence on the number of Re-labelléng processes that should be taken for self-

labelling. Of course Re-la belling can be extended to arbitrary dept hs, although typically

one will reach a point of distinguishing rei;urn where no labels can be assigned to undefined

data (dabelled data that have not been assigned with labels).

The ideal situation for the Neighbouring and Re-labelling methods is a dataset with a

large interclass and small intraclass distance. In this situation, each round of the labelling

procedure will result in a large number of correct self-labellecl data. Then, in the Re-labelling

procedure, these data patterns and their neighbouring nodes will be used to obtain even

more correct self-labelled data. If this ideal condition for the class distances is not met,

then, the number of mis-classified patterns produced will be large. Analyzing the GCC

system performance step by step wiU resdt in constructing an error mode1 that can be used

as evidence to tenninate the labelling procedure. On of the important cc~sumptions in the

GCC procedure is that the probability of error in the original labelled data is zero. (As an

important side note, this assumption does not have to be correct). By proceeding trough

the Alliance and Consemative methods, some of the unlabelled patterns are mis-classifiecl

and are added to origind labelled data to be used again. Other than these two strategies,

the Neighhuring technique can result in mis-classilied patterns. Especially in dataets such

as E. coli, Diabetes, or Heart Disease where classes are scattered al1 over the data space

with no specific order and close to each other. Consequently, these newly added labelled

data (with lots of mis-classifiecl items) dong with the original labelled data will be used for

the Re-labellzirg procedure- Starting fiom the first stage (with the difference of having a

large amount of incorrect data patterns in the labelled data) can result in even more mis-

classifled data. Therefore, over application for the self-labehg procedure may jeopardize

the accuracy of the labelled data and has to be careiùlly considered. When the number of

undehed data patterns remains unchanged, the Re-la belling procedure has to be halted.

As an example, E. wii was selected to investigate the over-labelling problern. Two

dXerent labelled dataset were used for training and self-labelling process. One was the same

labelled data (32 patterns, 16 for each class (refer to previous chapter)) used in previous

experiments and the other one was 32 randomly selected samples, 8 from class 0157 and

24 from class non-0157. For each dataset, the Re-labelling procedure was applied 8 times-

Figure 5.7 demonstrates the labelling accuracy of these datasets where

Labelhg accuracy- (number of correct self-labelled data) / (number of self-labelled

data obtained).

For a balanced nuniber of classes, the labelhg accuracy remained mchanged (with

mînor changes), however, for an unbalancecl number of classes, the labelling accuracy de-

creased dras t idy- This happened because of the increase in the nuniber of mis-classified

data.

Figure 5.7: Test on E. do': relabelling accuracy versus nurnber of relabelling procas

75

70

65

bo 60

J

i Ell

5 50

45

40

5.5 Conclusion

- 1 I I 1 I 1 I r baianceci nurnber of dasses -

unbalanced nurnkr of dasses ------ -

68 68 67

-

- 56 ---_____-__ _ _ _ 56

---_ ----___-- _--- , _---.

-

- 44

L I 1 1 I I I

Ideally, the effort is to set specific conditions and to generalize system performance based

on those conditions. Generalization based on empirical results may not be applicable or

accepted without providing theoretical proof. However, it is difEcult to assess the quality

of the approach based on the theoretical arguments- This study does not concentrate on

theoretical aspect of the approach or experiments because of time Limitations. It is basically

an introduction to a new technique on the use of unlabelled data in classification. More

detailed study is needed to establish a theoretical bais for the current approach and its

performance.

O 1 2 3 4 5 6 7 8 9

depth re-labeiling pmcess

In this chapter, some of the questions related to the GCC system and its variations were

pointed out with an example. More detailed study still needs to be done. Major problems

during the experiments were the ones related to randomly the selected labelled datasets

(their distribution) and the size of the labelled data versus unlabelleci datasets. The good

news is that the majority of real-world problems deal with a large number of unlabelled

data. In addition, the number of classes, their distributions, and other prior knowledge

about the datasets are often available. With the use of Experiment 4 (to select labelled

data) and large number of unlabelled data the available, the GCC system seems to be a

promising approach.

Chapter 6

Conclusions

The weU known problem of insufEcient labelled data is still an open question in any field

of supervised learning where processes are faced with unsatisfactory performances and high

cost of labelling training patterns. Various attempts have been made to overcome this

problem. Much of the research, such as: CO-training by (Blum and Mitchell, 1998), or

combination of Expectation M-at ion wit li O ther classifiers or hierarchical shrinkage by

(McCallum and Nigam l999), focus in the use of udabelled data on improving classification

in the t ext problem domain. This t hesis has presented a simple, yet novel technique (Guelph

Cluster Class (GCC)) which shares some similarities with other approaches on the use of

unlabelled data. However, GCC is based on neural networks and has been tested on sets of

real-world pro blems other that text problem.

This work is concerned with a relevant neural classification problem (introduced in

Chapter 2) often encountered in practice where the amount of labelled data is ins-iicient for

training, validating and testing a neural network. The GCC algorithm has been described

dong with its implementation and practical applications. The novelty of this approach Iies

in applying a Self Organizéng Map for clustering and the use of the resulting clusters to

assign labels to unlabelled data (self-labelling process) . The ability of the SOM to provide

information about the structure of the labelled dataset was tested using the Alliance and

Consemative methods. The SOM has the tendency of clustering siniilar input vectors on the

same topological neighbourhood close to each other, This thesis has shown how to apply the

neighbouring nodes in the GCC system (Neighbouring approach) to produce self-labelleci

data in sufEcient quantities and with s a c i e n t accuracy. In addition, the GCC system used

the resulting self-labelled data to assign labels to unlabelled data (m-labelling method) and

obtain more labelled data. In the classification stage, self-labelled data was applied to

enhance the training of a BP network- Cornparison of the results was based on BP network

performance on test data using the original labelled data for training. To evaluate the

GCC system, the selected datasets (six rd-world benchmark datasets) were statistically

dxerent. Results of these experiments (1,2 and 3) showed that the GCC approach had

significant merit for integrating udabelled data into the domain of supervised learning.

In the last experiment, the clusters produced by the SOM were used to help select the

most effective samples of udabelled data which should be labelled by an expert. Since the

labelling process (of the original labelleci data) is a costly exercise, information from the

clustering stage codd direct attention to those unlabelled samples which would be useful

on the self-labelling process if their true labelling were known. In the last experiment,

original labelled data was reduced to a significantly smaller size than previous experiments,

although, the GCC performance remained the same for most of the datasets.

Similar to other approaches, there are some potential caveats into the use of the GCC

approach. Because of the random nature of the experiments and its sensitivïty to initial

data, it is hard to predict GCC performance. It is not exactly clear how to control the

number of mis-classified data and how it would effect the classification procedure. However,

the GCC system has shown to be a reliable technique to improve supervised learning.

It can work with a very small size of labeUed data (properly selected) and thus reduce

cost. The GCC is a flexible technique to be used on different datasets. It is an easy

technique for integrating unlabelled data, compared to other approaches in which some

essential assumptions have to be made which may not always be possible (e.g. (Blum and

Mitchell, 1998) and (McCallum and Nigam, 1999)).

6.1 Future Work

The studies described in this thesis layout the foundation for future researches. Further

more, they help open doors into new applicationst Yet, more detailed researches need to be

done in order to provide more insights into the 1:echnique.

Indeed, it is difficult to assess the quality of the introduced approach based on theoret-

i d arguments. However, some more theoretical background and more solid argumen-

tation would make the work stronger. This effort will be used in the selection of GCC

system features such as the SOM and labelling process (parameters and methods).

A more detailed investigation of statistical c.haracteristics of datasets would make the

seiection of GCC system parameters and methods fsuch as: Allzaace or Conservatiue,

Neighbouring, and re-labelling) much easier. It may answer some of the questions

(caused by the random nature of this study) and eliminate unpredictability in the

syst em's performance.

Usage of other clustering technique with similar algorithm can result in interesthg

outcornes. Other clustering technique may outperform the SOM self-labelling process

on some of the datasets.

Involving fuzzy logic theory in the labelling process may reduce the number of mis-

classified data. It is important to have control over mis-classified patterns, since they

rnay mis-lead the re-lcrbelling process and as a result the classification procedure.

Exploring other techniques and dgorithirw for labelled data selection (active leaming)

to achieve: i) the smallest number of labelled data (to reduce the cost), ii) the most

effective labelled data (to increase self-labelled data quantity and accuracy).

Investigating GCC system's performance on the datasets with more than three classes

that are nonlinearly separable.

Appendix A

Experiment al Resualt s

A.l Experiments 1, 2 & 3: Labelling Procedure

I I 1 Alliance -

Carisecvative -------

1 I r

lnltial stage Neighbouring Neighbounng+Re-lablling

Methods

Figure A.1: Labelling Accuracy, IRIS dataset

Alliance - Conse~ative --------

8 - - - - - - - - -

20 2 3 r

4

O Initial stage Neighbouring Neighbou ring+Re-labelling

Methods

Figure A.2: Labelling Ability, IRIS dataset

Figure A.3: L a b e h g Accuracy, E. coli dataset

90

85

80

bo $ 75 C E E a w 70 a

65

60

55

Initial stage

I T Alliance -

=--tive -------

-

-

-

-

-

- I I ,

Neighbouring Methods

Initial stage Neighbouring Neighbouring*Ra-labelling Methods

Figure A.4: Labelling Ability, E. coli dataset

Alliance - Conservative ------

I 1

Initial stage Neighbouring NdghbauringtRe-labelling

Methods

Figure A.5: Labelling Accuracy, Breast Cancer dataset

1 I I Alliance -

Consenrative -------

Initial stage Neighbourlng Methods

Figure A.6: Labelling Ability, Breast Cancer dataset

initial stage Nelghbouring

Methads

Figure A. 7: Labelling Accuracy, Mushroom da taset

w

Initial stage Nelghbouring

Methods

Figure A.8: Labelling Ability, Mushroom dataset

Figure A.9: Labelliog Accuracy, Dia be tes dataset

90

85

80

bo $ 75 E Ln E O

aa 70 CL

65

60

55

Alliance - 1

. v I L

Alliance - Conservative -------

-

-

-

-

- -

I 1

lnitlal stage

Initial stage Neighbouring NeighbouringtRe-labetling

Methods

Nelghbouring Methods

Figure A.10: Labelling Ability, Diabetes dataset

Figure A. 11: Labelling Accuracy, Heart Diseuse dataset

; 75 t

E m 70 a

65

60

55

I Alliance -

Conservative -------

- ---- -----___ ---------_______ 7 E 76 ----

74 -

-

-

I I

Methods

Initial Sage Nelghbauring ~eighbauring+Fie-labelllng

Methods

-

- Initial stage Nelghbourlng Neighbouring+Re-laberIllm

Figure A.12: Labelling Abiiity, Heart Disease dataset

A.2 Experiments 1, 2 & 3: Classification stage (Results on

test data)

95 1 I I 1 1 I

Allinace - Com-üve -------

90 -

85 - 7' a? 81 .-.--

80 - / :.- A --------- ---------

O 't __--- a

75 - .-

73 -

LabelIed+Setf-labeiled data

Figure A. 13: BP performance on the IRIS dataset

l I I 1 1

Consetvative - Alliance

74 73


Figure A.14: BP performance on the E. colé dataset

LabelledtSeM-labelled data

Figure A.15: BP performance on the Breast Cancer dataset

110 I I I

Alliance - comervative

.- O

- 1000 M O 0 3000 4000 5000 6000

Labelled+Semabelled data

Figure A.16: BP performance on the Mushrwom dataset

60 50 100 150 200 250 300 350 400 450 500 550

tabel[ed+Self-labelled data

Figure A. 17: BP performance on the Dia betes dataset

1 r r r T T Alliance - Corservative -------

200 300 400

Labelled+SeH-labelleci data

Figure A.18: BP performance on the H a r t Disease dataset

A.3 Experiment 4: Classification stage (Results on test data)

30 40 50


Figure A-19: BP performance on selected IRIS labeLIed data

I 1 I I I 1 Method 3 - Method 2 ------ Meth& 1 a.------

73

Labelledeelf-iabelled data

Figure A.20: BP performance on selected E. coli labelled data

lx) 1 1 1 I I I 1 I Method 3 - Method 2 --

110 Meth& 1 ------.-

50 100 150 200 250 300 350 400 450

Labelled+Self-labetled data

Figure A.21: BP performance on selected Bmast Cancer labelled data

O 1000 2000 30W 4000 5000 6000

Laballed+Self-lablled d a t a

Figure A.22: BP performance on selected Mushroom labelled data

, Method 3 '- Melhod 2 -- Melhod 1 .-.a-.--

40 1 I I I 1

100 200 300 400 500

Labe[led+Setf-labelled data

Figure A.23: BP performance on selected Dia betes labelled data

, I I I Method 3 - Method 2 --- M&,o,-J 1 .--.--.-

300 400 500


Figure A.24: BP performance on selected Heart DzSease labelled data

Bibliography

Ba-, E-B-, Haussier, What Size Net Gives Valid Generalization?. Neurnl Computation,

1, 151-160, 1989.

Blum, A., and Mitcehll, T., Combining Labeled and Unlabeled Data with C d h h ï n g -

Proceedings of the 11th Annual Conference on Cornpututional Learning Theory, 1998.

Castelli, V., Cover, T.M., On the Exponential Value of Labeled Samples. Pattern recognition

Letters, 16, 105-111, 1995.

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S.: Guppy, K.,

Lee, S., and F'roelicher, V. , Internat ional application of a new probability algorithm for

the diagnosis of coronary artery disease. Arnerkm Journal of Cadiology, 64, 304-310,

1989.

Fisher, R., The use of multiple measurements in taxonomie problems. Annuls of Eugenics,

7, 2, 179-188, 1936.

Haykin, S., Neural Networks A Compreheusive Foundation. Prentice Hall, Upper Saddle

River, New Jersey, 1999.

Iba, W., Wogulis,J., and Langley,P., Tkading off Simplici ty and Coverage in Incremental

Concept Leanzing. In Proceedings of the 5th International Conference on Machine

Learning, 73-79, A m Arbor, Michigan: Morgan Kaufmaiui.

Joutsiniemi, S.-L., Kaski, S., and Lamon, T. A., SeE-organizing map in recognition of

topographic pa t te rn EEG spectra. IEEE Thnsactions on Biomedical Engineering,

42, 1062-1068, 1995.

Kaski, S. and Kohonen, T., Exploratory data analysis by the Self-organizing map: Struc-

tures of welfare and poverty in the world. Neural Networks in Financial Engineering.

Proceedings of the Third International Conference on Neural Networks , 498-507, 1996.

Kaski, S. and Lagus, K., Comparing self-organizing maps. Proceedings of In ternational

Conference on Artificial Neural Networks, 1112, 809-814, 1996.

Kohonen, T., Kaski, S., Lagus, K-, and Honkela, T., Very large twdevel SOM for the

browsing of newsgroups. Proceedings of International Conference on Artificial Neuml

Networks, 11 12, 269-274, 1996-

Kohonen, T., Self- Organizing Maps. S pringer, 1997.

Muller, K., Finke, M., Schulten, K., Murata, N., and Amari, S., A Numerical Study on

Learning Curves in S tochastic Multi-Layer Feed-Forward Networks. Neural Compu ta-

taon, 8, 1085-1106, 1996.

Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-trairiing.

Ninth International Conference on Infornation and Knowledge Management, 200.

Nigam, K., McCdum, A., Thrun, S., and Mitchell, T., Learning to Classifjr Text fiom

Labeled and Unlabeled Documents. Amerz'can Association for ArtZficéal Intelligence,

1998.

McCallum, A., and Nigam, K., Text Classification by Bootstrapping with Keywords, EM

and Shrinkage. ACL '99 Workshop for Unsupervised Leaming Zn Natuml Language

Processing, 1999.

Schuurmans, D., A New Metric-Based Approach to Mode1 Selection. Proceedings of the

Fourteenth National Confefience on Artificial Intelligence, Providence, RI. July 1997.

Shahshahani, B.M., Landgrebe, D. A., The Effect of Unlabeled Samples in Reducing the

Small Sample Size Problem and Mitigating the Hughes Phenornenon. IEEE f insac-

tions on Geoscience and Rernote Sensing, 32, 5, 1994-

Smith, J-W., Everhart, J-E-, Dickson, W.C., Khowler, W.C., and Johannes, R.S., Using

the ADAP learning a l g o r i t h to forecast the onset of diabetes mefitus. Proceedings

of the Symposium on Cornputer Applications and Medical Care, 261-265, 1988, IEEE

Computer Society Press.

S tacey, D.A., Preliminary Artificial Neural Network Analysis of E. wli Data. Unpublished

Report, Nov, 1998.

Virginia R. de Sa., Learning Classification with UnlabeIed Data. Advances in Neural

Information Processing Systems, 6, 112-119, 1994.

Zhang, J., Selecting typical instances in instance- based leaming. Proceedings of the Ninth

International Machine Learning Conference, 470-479, 1990, Aberdeen, Scotland: Mor-

gan Kaufinann.