Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
THE USE OF UNLABELLED DATA FOR SUPERVISED LEARNING
A Thesis
Presented to
The Faculty of Graduate Studies
of
The University of Guelph
by
ROZITA D A M
In partial fulfiiment of requirements
for the degree of
Mas ter of Science
August, 2001
@Rozita Dara, 2001
National Library 1*1 of Canada Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographic Services services bibliographiques 395 Wellington Street 395, rue Wellington OttawaON K7AON4 Ottawa ON K I A ON4 Canada Canada
Your file Voue dltimna
Our fi& Noire réldcollc~
The author has granted a non- exclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microfonn, paper or electronic formats.
The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or othewise reproduced without the author's permission.
L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fiùn, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation,
ABSTRACT
THE USE OF UNLABELLED DATA FOR SUPERVISED LEAR.NING
Rozita Dara University of Guelph, 2001
Advisor: Professor D. A. Stacey and Professor S. C. Kremer
When provided with enough labelled training examples, a supervised learnllig algorithm
can learn reasonably accurately. However, creat ing sufEcient labeUed data t O train accurate
classifiers is time consuming and expensive. On the other hand, unlabelled data is u s u d y
easy to obtain. This research introduces a noveI approach, Guelph Cluster Class (GCC),
which improves the task of classification with the use of unlabelled data- The novelty of
this approach Lies in the use of an unsupervised network, Self-Orgunizing Map, to select
natural clusters in labelled and unlabelled data. Subclasses (made by labelled data) are
used to assign labels to udabelled patterns to produce self-lubelled data. The performance
of several variants of the GCC system have been obtained by running a Buck-Propagation
network on labelled and self-labelied data. Results of experïments on several benchmark
datasets demonstrate an increasing power for the classification procedure even when the
number of labelled data is very small.
Acknowledgement s
1 am indebted to my advisors:
Dr. Deborah Stacey for her support, mentorship and guidance in the course of m y
studies in Guelph. The extent and depth of her intelligence have never failed to
inspire me. Deb's honest concern for students and her pleasant disposition made my
interaction with her a t d y rewarding experience-
Dr. Stefan Kremer who generously provided me with encouragement, support and
creative insight. His enthusiasm for science has laid the groundwork for my future
research career in science.
1 would &O like to thank my advisory cornmittee member, Dr. David Calvert, for
directing me throughout my research.
1 am quite fortunate to have been surrounded by caring and funny friends and colleagues
(Rami Zeineh, Narendra Pershad, Sudip Biswas, Orlando Cicchello, Saira Ahmad, Neil
Harvey aud R;rmiri Farshad Tabrizi) who made the t h e that 1 spent in Guelph enjoyable.
This thesis is dedicate to my husband, Shayan Sharif, who provided endless supplies of
encouragement and advise. He kept me motivated through the tough times and this thesis
would not be possible without his generous efforts. 1 am greatly thankful to him.
Contents
1 Introduction
2 Literature Review 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Unsupervised learning 6
. . . . . . . . . . . . . . . . . . . . . . . 2.3 SelfOrganizingFeatureMap . .. 7
2.3.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 TheLearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 9
. . . . . . . . . . . . 2.3.3 The SOM properties useful in data exploration 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Supervised Learning 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Back-propagation 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Previous Work 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusion 22
3 Implementation 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 linplementation 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Data 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Labelling Process 26
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Re-labelling 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Neighbouring 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Classification 30
3.2.6 The GCC (Guelph Cluster Class) Algorithm . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Znitial labelled Data Selection 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experiments .. 32
3.5 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion 40
4 Experiments 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction 42
. . . . . . . . . . . . . . . . . . . . . . . . 4.2 Data Collection and Description 42
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experiments 47
4.3.1 Experiment #l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Experiment #2 55
4.3.3 Experiment #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.4 Experiment #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Conclusion 67
5 Analysis and Discussion 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5 -2 Analysis of Experiment #1 72
... lll
5.2.1 Analysis of Experiment #4 . . . . . . . . . . . . . . . . . . . . . . . 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis of Experiment #2 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Andysis of Experiment #3 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion 80
6 Conclusions 82
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Future Work 84
A Experimental Ptesualts 85
. . . . . . . . . . . . . . . . . . A.1 Experiments 1. 2 & 3: Labelling Procedure 85
A.2 Experiments 1, 2 & 3: Classification stage (Results on test data) . . . . . . 92
. . . . . . . . . . . A.3 Experiment 4: Classification stage (Results on test data) 96
List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Conservative method 35
3.2 Alliance method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Number of Labelled. Udabelled. and Test data . . . . . . . . . . . . . . . . 49
Size of Classes in Labelled and UnlabeUed data . . . . . . . . . . . . . . . . 49
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Networks Parameters 51
Confusion Matrix for UnlabeUed data (E . coli) . . . . . . . . . . . . . . . . 52
Confusion Matrix for Unlabelled data (Mushroom) . . . . . . . . . . . . . . 53
. . . . . . . . . . . . . . . . . Network Parameters for the Alliance method 56
Alliance and Conservative Results . . . . . . . . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . . . . . Size of Labelled and Unlabelled data 64
Labelled Data Selection ResuIts . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Self-labelling accuracy in Alliance and Conservative methods . . . . . . . . 77
List of Figures
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Un-Supervised Learning 6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Self Organizing Map 8
2.3 Neighbourhood Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 SupervisedLearning 14
. . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Back-propagation Architecture 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Labelling Process 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Neighbourhood Process 30
3.3 Iris Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Self-labelling accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Self-labelling abilïty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 BP performance on Test data, IRIS dataset . . . . . . . . . . . . . . . . . . 39
3.7 Testing performance (BP network) on selected IRIS labelled dataset . . . . 40
4.1 E . Coli data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Breast Cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Mushroom data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Diabetes data 46
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Heart Disease data 47
. . . . . . . . . . . . 4.6 GCC ability improving classification of E . coli dataset 54
. . . . . . . . . . 4.7 GCC ability improving classïfkation of Mushroom dataset 54
4.8 Labelling ability through difFerent techniques on the B m s t - Cancer dataset 59
4.9 GCC ability on improving classification of Breast-Cancer dataset . . . . . . 59
4-10 Labelling ability througb difTerent techniques on the Diubetes dataset . . 60
. . . . . . . . . 4.11 GCC ability on improving classincation of Diabetes dataset 60
4-12 Testing performance (BP network) on selected Heart-Diseuse labelled data 66
4.13 Testing performance (BP network) on selected Mushroorn labelled data . . 66
. . . . . . . . . . . . . . . . . . . . . 5.1 IRIS data ordered map (training data) 69
. . . . . . . . . . . . . . . . . . . . . . . 5.2 A sample labelled data distribution 70
. . . . . . . . . . . . . . 5.3 A sampie labelled and unlabelled data distribution 70
. . . . . . . . . . . . . . . . . . . . 5.4 SOM map for 15 IRIS labelled patterns 73
. . . . . . . . . . . . . . . . . . . . . 5.5 SOM map for 7 IRIS labelled pattern 73
5.6 Number of mis-classifled data versus BP network's performance on the test
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . data 75
5.7 Test on E . coli: relabelling accuracy versus number of relabelling process . 80
. . . . . . . . . . . . . . . . . . . . . . . . A . l LabelhgAccuracy. IRISdataset 86
. . . . . . . . . . . . . . . . . . . . . . . . . A.2 Labefluig Ability. IRIS dataset 86
A.3 Labelling Accuracy. E . coli dataset . . . . . . . . . . . . . . . . . . . . . . . 87
A.4 Labelling Ability. E . coli dataset . . . . . . . .. . . . . . . . . . . . . . . . 87
A.5 LabellingAccuracy. Breast Cancerdataset . . . . . . . . . . . . . . . . . . . 88
vii
. . . . . . . . . . . . . . . . . . . . A.6 Labelling Ability. Breast Cancer dataset 88
. . . . . . . . . . . . . . . . . . . . . A.7 Labelling Accuracy, Mushroom dataset 89
. . . . . . . . . . . . . . . . . . . . . . A.8 Labelling Ability, Mushroom dataset 89
. . . . . . . . . . . . . . . . . . . . . . A.9 LabellingAccuracy, Diabetesdataset 90
. . . . . . . . . . . . . . . . . . . . . . . A.10 Labelling Ability, Diabetes dataset 90
. . . . . . . . . . . . . . . . . . . A . 11 Labelling Accuracy, Heart Diseose dataset 91
. . . . . . . . . . . . . . . . . . . . A.12 Labelling Ability, neart Diseuse dataset 91
. . . . . . . . . . . . . . . . . . . . . . A . 13 BP performance on the IRIS dataset 92
A.14 BP performance on the E . coli dataset . . . . . . . . . . . . . . . . - . . . . 93
A.15 BP performance on the Breust Cancer dataset . . . . . . . . . . . . . . . . 93
A.16 BP performance on the Mushroorn dataset . . . . . . . . . . . . . . . . . 94
A.17BPperfomanceontheDiabetesdataset. . . . . . . . . . . . . . . 94
A.18 BP performance on the Heurt Disease dataset . . . . . . . . . . . . . . . . . 95
A . 19 BP performance on selected IRIS labelled data . . . . . . . . . . . . . . . . 96
A.20 BP performance on selected E . cole labelled data . . . . . . . . . . . . . . . 96
A.21 BP performance on selected Breast Cancer labelled data . . . . . . . . . . . 97
A.22 BP performance on selected Mushroorn IabelIed data . . . . . . . . . . . . . 97
. . . . . . . . . . . . . . A.23BPperformanceonselectedDiabeteslabelTeddata 98
A.24 BP performance on selected Heart Disease labelled data . . . . . . . . . . . 98
Chapter 1
Introduction
The task of classéfication occurs in a wide range of everyday human activities. It is applied
in any context where a decision or forecast is to be made based on the currently available
information, and a classification procedure is a technique for repeatedly making such deci-
sions in new situations. As for its practical definition, the classifkation is involved in the
construction of a procedure that will be applied on a sequence of cases with pre-defined
classes, in which each new case mwt be assigned to one of those classes based on previously
seen examples. The construction of a classification procedure has various names, such as
pattern recognition, discrimination, and supervised learning. Several approaches have been
taken toward this task. The main historical branches of research are: statistics, machine
learning, and neural networks.
When provided with enough labelled training samples, a variety of supervised learning
algorithms can learn to be reasonably accurate classifiers. Practically, rnost of the classifica-
tion procedures need large amounts of data for training, especidy when the dimensionality
of the input features or number of classes is large. However, problems arise when there is
an insuflicient number of labelled training examples available. Creating a sdficient amount
of labelled data is tedious and expensive, since they often have to be labelled manually.
This lack of data can cause poor estimation of the parameters which will cause inaccurate
generalizat ion of the unseen data. Anot her pro blem is wirepresent ative training samples
which will raise diiliculties in analyzing the data- The training samples from one region of
input space for a class might not be a good representative of the samples fkom the same
class in other regions (Shahshahani, 1994).
Similar t O other classification t ethniques, supervised art ificial neural networks (e.g. feed-
forward networks) may s d e r ikom a lack of training samples. In contrast to labelled data,
unlabelled data may be generated more easily. As a result, it would be extremely useful if
it were possible to use unlabelled data in the supervised learning procedure. This research
and some other studies have integrated unlabelled data into supervised learning procedure-
Theoretical aspects of the effect of labelled and unlabelled data in classification have
been previously ex;imined by severd researchers (e-g. (Castelli and Cover, l995)). In (Blum
and Mitchell, 1998), they have used cetraining and redundantly siifficient features in their
approach. A major assumption in this technique is the existence of a natural separation
of their input features into two disjoint sets in each dataset which may not be found in
all datasets. In another approach by (Nigam et al., 1998); a combination of Ezpectation
Maximziation and naive Bayes classification is used. In this method, the naive Bayes
classifier is used to make an initial classifier using labelled data. Then, E M is applied to
assign probabilistically weighted labels to unlabelled data. In (McCallum and Nigam, 1999),
a completely unsupemked approach is used for the task of trainhg a text classitier. They
use the bootstrapping algorithm which is a combination of EM and hierarchical shrinkage
in their approach.
This thesis introduces a novel algorithm (Guelph Cluster Class (GCC)) that improves
classification using unlabelled data. Instead of training a supervised learning system with
a small number of labelled data, GCC applies an unsupervised artficial neural network
(Self Organizing Map (SOM)) to assign labels to unlabelled data and obtain more labelled
data. The GCC picks out naturd clusters in the input data. These clusters are gathered
in such a way that the topology in the output space corresponds to the topology in the
input space. Then, the labelled data along with the SOM work as a classifier and assign
labels to dabelled data to provide self-labelled data. Self-labelled data WU then be used
to reorganize the clusters and to provide more accurate training samples for the supervised
system. Ultimately, a supervised learnuig network, Buck-propagation, is used to test self-
labeiled data produced by the GCC system-
In order to test the GCC's performance, several experiments have been performed on
six benchmark datasets with dxerent statistical characteristics and various degree of clas-
sification difEculty. The goal of each experiment is to investigate the SOM'S ability to
produce more accurate self-labelled data. One of the experiment concentrates on training
the SOM with Werent inputs: labelled data (the Consemutive method) and a combina-
tion of labelled and unlabelled data (the Alliance method). In another experiment, the
process of assignhg labels to dabelled data is extended to neighbouring nodes (Neigh-
bouring method). In addition, a re-labelling method and the degree of mis-classification of
the unlabelled data is evaluated. In a completely unsupervised approach, SOM is used to
select the appropriate labelled data (active learning). Selection of labelled data is useful i)
in reducing the amount of labelled data needed for the classi6cation procedure (reduces the
cost of labor), ai) in selecting the most effective labelled data.
This research shows how the GCC system can be used to obtain sficient amounts
of self-labelled data with sufficient accuracy to increase the generalization ability of a BP
network. In addition, it shows how clusters produced by the SOM can be used to select the
initial labelled samples. In conclusion, it can be said that self-labelled data obtained fiom
the GCC approach have a significant effect on the improvement of a supervised learning
pro cedure.
This thesis is partitioned as follows: Chapter 2 is an introduction to the background
information in the field of neural networks, especially those techniques that are used in this
thesis. In addition, it consists of an overview of the problem and the current approaches
used by other research groups. Chapter 3 concentrates on the problem space and provides
t heoretical and practical information about the proposed approach (GCC) Furthemore,
it discusses several experiments that have been carried out to explore variants of the GCC
system. Chapter 4 presents the datasets and the results of the experiments. Chapter 5
contains analysis and discussion of the results presented in Chapter 4. Finally Chapter 6
is a siimmary and conclusion of the previous chapters and suggested future work for th%
research.
Chapter 2
Literat ure Review
2.1 Introduction
Classification has a wide variety of applications in fields such as indus try, commerce, and
research. The purpose of classification is to find interesting patterns and new knowledge
from databases where the dimensionality, complexity, or amount of data is significantly
large and manual analysis is impossible.
The field of ArtScial Neural Networks is one of the popular research approaches in
pattern recognition. One of the important aspects of neural networks is the challenge of
reproducing intelligence itself- This results in unique properties for neural network systems
which are generalization oves unseen data and overcoming the problem of data complexity.
Neural network approaches combine the complexity of some statistical techniques with the
objective of simulating human intelligence; however, this is done at a more unconscious
level. The majority of the work in this field can be grouped b t o two learning frameworks:
supermised and unsupervised. The following sections contain a brief introduction to the
concept of s u p e ~ i s e d and unsupervised learaing in addition to the networks used in this
t hesis.
2.2 Unsupervised learning
The basic notion of unsupervised learning is that no target values are involved in the
learning process. The algorithm attempts to learn the structure of the examples without a
teacher defining the classes prior to the procedure- Tu'o error feedback is provided during the
procedure (Figure 2.1) and the system Iearns to adapt based on the results that have been
collected f?om the previous training patterns and some form of internai distance measure.
The results of such a system wodd be a siimmary of some properties of the objects in the
dat abase.
Figure S. 1: Un-Supervised Learnirig
CLUSTERING
There are several reasons for being interested in the unsupervised procedure:
HUMAN . INFORMATION
Data Reduction (Clustering): The goal of clustering is to reduce the amount of data by
categorizing or grouping simrlar data items together . One major motivation for using
clustering algorithms is to provide automated tools to construct categories (clusters)
and to minimize the effects of humans in the process.
Dimensionality Reduction (Projection): Projection methods are used to reduce the
dimensionali@ of the data items. The goal of the methods is to represent the input
data items in a lower-dimensional space in such a way that important properties of
the data set are presenred as much as possible-
* Data Visualkation: Unsupervised learning is very useful for the visiialization of high-
diniensional data items. There are several methods that may be used for this purpose
such as Andrews' Curve, Chernoff's faces, and five nuniber summaries.
Classification: Ln many applications of real world data, datasets with targets are
scarce. Databases are relatively large and too complicated to be classified by humans.
Unsupervised methods may be used to create targets.
One of the major disadvantages of unsupemised learning is its inability to perform
the classification task. In order to perform ciassification, human information is required to
transform the output of unsupervised learning systems into classes. Recent ly, there has been
much interes t in the use of unsupervised neural computation met hods. The Self-Organizing
Map is one of the most popular models in this field.
2.3 Self Organizing Feature Map
The Self-Organizing Map (SOM) was first introduced by Kohonen (Kohonen, 1997). The
SOM is modeled after neurobiological structures. The SOM takes advantage of both clus-
t ering and projection met hods and offers excellent visualizat ion capabilit ies and techniques
to compare input data items. The robust properties of the SOM make it a valuable tool in
data mining-
2.3.1 The Architecture
The brain cortex is arranged as a two-dimensional plane of neurons. Each neuron is a ce11
containhg a template that is used to match data patterns. The cells compute distances
between their tempIate and the input patterns. Cells with the closest match produce an
active output- These distances will be used for representing the of multi-dimensiond data
on a two dimensional plane of neurons. Kohonen also uses a topology similar to the braie
The output layer can be linear for the one-diniensional, in some form of grid for the two-
dimensional case (Figure 2.2), etc-
O r n P u T U Y E R
XI x2
INPUT LA YER
Figure 2.2: Self 0rg;rniaing Map
Kohonen's network consists of an array of nodes connected to each other based on some
t opology (e-g. rectangular, hexagonal) in one or two dimensional space see Figure 2.3. The
interconnections between the nodes only define neighbourhood relations and no weights
are assigned to them. As a result, these connections do not directly infiuence the learning
process in contrast to other types of neural network models. Each of the nodes forms an
output unit by having a weight vector (of the same dimensionality as input vector) assigned
to it. Input and weight vectors (2 and Si) are respectively denoted by
where n is the dimensionality of the input vector.
00000
Hexagonal
Figure 2.3: Neighbourhood Topologies
Weights are initially randomly generated without considering neighbouring nodes. Dur-
ing the training process these weight vectors are adapted in such a way that the topology
in the input spaces matches the topology in the output space.
2.3.2 The Learning Algorithm
The leaniing process begins by the cornpetition among the neurons. When an input i?
arrives, the neuron that is best able to represent the input wins a cornpetition and is
allowed to learn it even better. Considering the topology of the neighbouring nodes, not
only the winning neuron but also its neighbours are allowed to learn. Neighbouring units
will gradudy âdapt, during the training procedure, to respond to similar inputs- When the
training is finished, similar inputs are grouped to arrange clusters which can be represented
on the map. This is the essence of the SOM algorithm.
The weight vector di represents the t y p i 4 input for each neuron O. The unit whose
weight is nearest to the selected input x is the winner. The state of each unit with respect
to input Z(t) is calculated using an activation function based on the Euclidean distance
between that input vector and the weight vector 6 at time step t . Equation 2.3 describes
the computat ion:
Next, the best matching - i t is selected as the winner by using Equation 2.4. The unit
with niinimal ( t ) wodd be considered the wi-nner.
c(t) : 77Jt) = mzn(qi ( t ) ) = rnzn(l1 G ( t ) - Z(t) 1 1 ) z 2
where c( t ) is a winning unit at time t .
Another popular activation function used is based on the multiplication of the input
vector Z and the weight vector mi. In this case, evaluation of the activation function and
winning units would be as following:
The winning unit and its neighbours change to represent the input by rnodifying their weight
vectors. The nwnber of units that Iearn and change their weight vectors depends on the
neighbourhood kernel rp,(t). cpci(t) is a decreasing function of the distance between the
winning unit and the other units- Weight changes depend on a time-varying learning rate
e ( t ) . The adaptation of weight vector mi(t) results in a new weight vector *(t + l ) , which
most likely WU be selected as the winning unit at a future presentation of the same input
TO guarantee that the learning process ends in finite tirne, the amount of change has to
decrease gradually with tirne. This can be done by selecting a function called the learning
rate. It starts with a relatively large number in the range [0,1] and ends with a value close
to O. An example function might be
t ~ ( t ) = ~ ( 0 ) . exp (--)
learn
where &(O) < 1 and leam defmes a parameter responsible for the reduction by time.
The neighbourhood kernel is used to describe a neighbourhood area of the winning unit
c- Based on the WTA (winner takes alï) characteris tic of SOMs, no t only the winning unit,
but also the neighbouring units (dependin g on neighbourhood kernel, cp, (t )) will updat e
their weights. A popular example for a neighbourhood kernel is:
cp,(t) shodd decrease to its minimum 1 at the end of the process. u( t ) is a time decreasing
function which is defined as
v(0) is the initial neighbourhood area, and the neighbour parameter is responsible for the
amount of reduc t ion.
In summary, the algorithm is as follows:
Algorithm 1 The SOM Alerorithm
1: for each unit c the weight vector di is initially set to be random and the neighbourhood
9,; to be large.
2: one input t-ector Z is selected fiom all possible inputs.
3: an activation function (Equation 2.3 or 2-5) is used to calculate the state of each unit
with respect to the selected input vector
4: the best matching unit is selected ushg Equation 2.4
5: the weight vector of the wuiner 6, as well as the weight vectors of all units in the
neighbourhood of the winner are adapted using Equation 2-6
6: the neighbourhood kernel is decreased as well as the learning rate
7: next input is presented (Step 2)
2.3.3 The S O M properties useful in data exploration
The leamhg process of the SOM gives it specik characteristics which are useful in data
exploration.
Ordered display: the use of a map as a display for data items is very helpfui.
Items are mapped to those units that have closest weight vector and surrounding
neighbourhoods have sirnilar items mapped to them. Such an ordered display of the
datasets can ease the understanding of the statistical structures in the datasets.
Visualization of clusters: the sanie ordered display could be used to demonstrate
the clustering density in difFerent parts of the dataset's space. The density of the
weight vector will reflect the density of the input samples.
Missing data: a frequently occurring problem in data exploration is missing compe
nents in data vectors- A SOM may handle this problem by using Equations 2.3 or 2.5
on available elements in the input vector and its relative weight vector.
Outfiers: outliers rnay result in major problems in data analysis. The map generated
by the SOM algorithm may be used to detect and discard outliers fkom the dataset.
In addition, an outlier will affect just one unit and its neighbourhood, not the rest of
the t rainïng samples.
The major drawback to the use of SOM network is its incapability to perform the
classification task- The output of the SOM network must be manudy labelled which
is an extra cost.
2.4 Supervised Learning
In every analytical system there exists some patterns which have desired responses which are
known. The patterns and their desired responses are called inputs and targets, respectively.
The target may be a class, in this case the task is d e d class.ification, or a continuous signal,
in which case the task is d e d regression. The goal of supervised learning is to predict a
mode1 or mapping that will correctly relate the inputs and their targets. To achieve this
goal, a teacher will help the system's learning procedure by defining the correct labels and
providing the final error for the system. Final error wlll then be used to optirnixe the
learning paramet ers (Figure 2.4).
Figure 2.4: Supervised Learning
The biggest advantage of supenrised learning is its ability to generate correct outputs for
input data patterns that are not part of the training set. The other properties of supervised
learning are their robustness to noise and capabilïty of handling niissing elements in data
patterns.
There are a few disadvantages to the use of supervised leaming. Supervised learniag
methods are not immune from sensitivity to badly chosen initial data and parameters in
the method, as well as slow learning speed. They need a large amowlt of data for trnining.
In addition, providing large amounts of labelled data for the learning method is costly and
sometimes impossible.
Depending on the information tbat the teacher carries, there are two approaches to
supervised learning. One is just based on the fiwt that the decision is correct or wrong
(reinforcement learning) and the other one is based on the optimization of a training cost
function where the least square error approximation plays a major roll. The following
section covers a brief introduction on the Back-propagation network, a supervised learning
system, used in this thesis.
2.5 Back-propagat ion
The Back-propagation, Mdtilayer Pemeptmn, is the most popular supervised neural net-
work that is based on the error correcthg method. Back-propagation has successfüUy been
used in many different problems- Given enough training data, appropriate initid condi-
tions and architecture, it has been shown that BP is capable of Iearning the mapping of
any function to sat isfactory accuracy (Haykin, l999).
The network consists of a set of units that are arranged logically based on input, hidden
and output layers with no comection inside the layers. The number of hidden layers may be
more than one. The input of each Iayer is the output of the previous layer. The connections
carry weights which siinïmarize the network's behavior and are adjusted durhg training
(Figure 2.5).
OUTPUT LA YER
HIDDEN iAYER
INPUT LA YER
Figure 2-5: Back-propagation Architecture
The operation of the network consists of two stages through different layers, the forurard
15
pass and the backward pass(back-propagation). In the forwanl pass an input pattern vector
5 denoted by
is presented to the network. As the input passes through the network, the activation input
to the next layer is the scalar sum of the product of the incoming vector with its respective
weights. The generaI formula for the activation input to a node j is
where wji is the weight comecting node i to j and a t i is the output from node i. For
example, for the fhst hidden layer this equation is
where h y is the activation input to the hidden layer unit j and n is the number of input
nodes. The output of a node fkom each layer j is calculated based on its activation input
where f denotes the activation function of each node. A fiequently used activation function
is the logistic sigrnoid,
Since the back-propagation network is trained based on supervised learning, each input
vector has a desired output vector which represents the classifkation for the input pattern.
During the backward pass, weights are adjust ed according to the error (difference between
desired output and actual output of the system). The weights between the output layer
and the layers below are updated by the generalized delta rvle
where wkj(t + 1) are the updated weights at t h e step (iteration) (t+l), and E is a learnuig
rate parameter. The term 6 depending on the layer might be calculated differently. For the
output layer nodes, 6 is calculated with respect to the rate of change of error and the input
to node k
where dk is the desired output for node k. Ln this stage the weight connecting the output
layer arid layer below, hidden, are updated. Weights for the hidden layer, j, and below are
updated using equation 2.15. The 6 for these layers is calculated with respect to SI, in the
output layer
The back-propagation algorithm is a gradient descent optimization procedure which mini-
mizes the mean square error between the network output and the desired output for all the
input patterns
The algorithm is srimmarized as foUows:
Algorit hm 2 The back-propagat ion Algorithm
I: Initialkation: assigning random value [-1,1] to the weights in al1 the layers-
2: Presentcation of training data: training the network by randomly selecting an input
pattern fkom training set.
3: Forwani pass: computing the function signals of the network by proceeding forward,
layer by layer. Calculating activation input to each layer using Equation 2.12. The
output signal of each unit is calculated using Equation 2.14.
4: Backward pass: computing the 6s of the network using Equations 2.16 and 2.17. Ad-
justing the weights of the network using Equation 2.15-
5: Iteration: applying stages 2, 3 and 4 on the trairiing samples, until stopping criterion
is met.
Once the network has been trained, the weights are saved to be used in the classifi-
cation of unseen data (test data). The capability of processing unseen instances is d e d
generalization. When the generalization performance of the network on test data is much
worse than its performance on the training data, the problem is c d e d overfitting. Over-
fitting is sometimes due to the fact that the training material does not sufEciently cover
the class space. The second reason might be a high degree of non-linearity in the training
data. In both cases, it is obvious that backpropagation ( s i d a r to other supervised learning
methods) is signiûcantly sensitive to the training dataset itself and its distribution.
Overfitting is a very well known problem. Recently, there have been several at tempts to
avoid this problem either by: (é) changing the training data, or (ii) chchanging the size of the
networks. (Muller et al., 1996) performed a detailed study on generalization of multilayer
feed-forward network. They use a function which is based on the number of training samples
and network parameters. They tested a higher order universal asymptotic scaltng law
on a training examples to obtain a general theory for the training curve and number of
samples. In Equation 2.19, eg is the generalization ability, rn the number of parameters of
the network and n the number of triiinirig samples. For multilayer feed forward networks up
to 256 weights, they demonstrated strong ove&ting for a small number training samples
I n. In this case, the generalization error was estirnated as -. As the number of samples
n 1
increases, the bend of the Iearning cuve becornes close to -. n2
(Baum and Haussler 19893 discussed the same ~rûblem. Their result can be applied to
all the multilayer feed-forward learning algorithms. They addressed the questions of when
a network may be expected to generahe and what the range of training samples should be,
based on the number of units and weights. The results of this study showed that the Iower
and upper bound for random training samples m are
and
1 where W is the number of weights, N is the number of the nodes and O 5 E 5 -. It is shown
8 if m is higher than Equation 2.21, then at least a proportion of 1 - e of the test examples
will be correctly classxed. On the other hand, if it is lower than Equation 2.20 the network
si@cantly fails to classify 1 - e proportion of the test samples.
2.6 Previous Work
Creating suflicient labelled training examples to learn reasonably accurate classifiers is time
consuming and expensive, since t hey typically must be labelled manually. This problem has
led several researchers to consider learning a l g o r i t h that do not require a large amount
of labelled data. (Castelli and Cover, 1995) provide theoretical proof that unlabelled data
may be usecl to improve classification. In addition they discuss the value of the labeUed
data and its infiunce on classification error. The use of unlabelled data can be useful in
reducing the cost of classification procedure since unlike labelled data, wilabelled data is
easy to obtain and pleutlful.
In (Blum and Mitchell, l998), an algorithm as well as several experiments are introduced
to demonstrate how unlabelled data may improve supervised learnuig. This approach is
called CO-training and is applicable under the following assumptions for each dataset:
each dataset is redundantly sufficient for classification
features in each dataset are separated into two disjoint sets
The key idea behind the algorithm is that it uses two independent classifiers instead
of one. Each classifier uses a different charxteristic of the dataset to do the classification.
Both classifiers are trained using the labelled data. This results in two incomplete classifiers.
Then, each classifier examines the unlabelled data to pick the most positive and negative
examples and adds them to the labelled pool. These predictions are combined to decrease
classification error. Cetraining has beeo tested on webpage datasets and accuracy has
shown improvements üp to 743%. Cetraining may be a powerful method, but it is not
always applicable. A large majority of the datasets have just a single feature and if they
have more than one, those features are not independent. In (Nigam and Ghani, 2000):
they refer to this problem and compare results of the CO-training with the Ezgectation
Muximization (EM) algorithm. While c~tr;sining and EM performances are very close and
they are both applicable under certain assumptions that may not be met in all datasets.
(e.g. EM is based on the assumption of word independence which might be violated by the
text data) - These deficiencies lead the authors to the construction of an algorithm called
CO-EM, combination of CO-training and EM, which results in lower errors.
In (Virgiuia, l994), a completely unsupervised approach was applied on the Perterson-
Barney vowel dataset. In tbis approach, it is shown that an appropriate classifier may be
leanied without having any signal or labels. The algorithm is called self-supervised. In th%
algorithm, labels are assigned to codebook vectors using the k-nearest neighbour d g o r i t h ,
after they have been randomly chosen. Their labels are wed as the labels of the data
examples. These weights are updated through out the process. This algorithm is applicable
for tasks in which signals for two or more modalities are available.
A different approach to the use of unlabelled data has been introduced by Dale Schuur-
mans, (S&uurmans, 1 997), on optimization of the standard mode1 selection (a mechanism
to balance between hypot hesis complexity and data-fit) . This approach t akes advantage
of the distribution of wilabelled data, in order to investigate whether the true distance
between any two chosen sequences of hypotheses is violated (far from true distance). A
suitable distance under a predehed distribution of unlabelled data is used to estimate
the true distance. This method has been tested on polynomial cuve-fitting, which shows
significant improvement comparing to previous approaches.
EM is a popular technique and has been used in severd studies that combine labelIed
and unlabeUed data. Since the naive Bayes classifier suffers from high variance in insu£ücient
labelled data, in (Nigam et al., 1998), a combination of EM and this classifier are used to
overcome the problem. In this method, the naive Bayes classifier is used to make an initial
classifier using labelled data. Then, EM is applied to assign probabilistically weighted
labels for unlabelled data. EM finds a local m h u m likelihood parameterization using
both labelled and unlabelled data. The experimental results of this method on real world
datasets such as WebKB, News Groups, and ModApte demonstrate up to 33% improvement
in classification error,
In (McCallurn and Nigam, l999), a complete unsupervised approach is used on the task
of learning a text classier. This technique is applicable to text datasets when tbere are few
keywords per class and a class hierarchy is available. They use a bootstrapping algorithm
which is a combination of EM and hierarchical shrinkage. In their approach, keywords
are used to generate prelimhary labels by term matching. Then, EM uses these labels
and the keywords and class hieratchies to reassign probabilistically weighted class labeIs to
unlabelled data. Classification is further improved by ushg s h r ' i g e (a statistical mode1
for improving parameter estimation for sparse data). Experimentd results on web computer
science topics show an accuracy close to human performance.
In (Shahshahmi, l994), E M and a mixture of Guassians is used to investigate the effect
of unlabelled data on improving feature extraction, classification and class statistics in a
remote sensing application, EM is used to find Maximum Likelihood (ML) on both labelled
and unlabelled data. The starting point for ML estimates are obtained by training examples.
2.7 Conclusion
In order to identify the problem and its proposed solution addressed in this thesis, this
chapter provided an overview of all the riecessary background information such as: super-
vised and unsupervised learning, SOM and BP networks, existing limitations in supervised
learnllig procedure and previous studies on the problem. In the following chapter, a novel
algorithm called (Guleph Cluster Class) is introduced. The problem (concerning the clas-
sification procedure) and its proposed solution will be discussed in detail. Furthermore, an
overview of the experiments as well as their objectives will be provided. A brief introduction
to the experimental procedures is provided by the example using the IRIS dataset. The
rest of the datasets, experiments and their results are discussed in detail in Chapter 4.
Chapter 3
Implement at ion
3.1 Introduction
In the previous chapter supervised and unsupervised learning were briefiy introduced as
weU as their advantages and disadvantages. Background information on BP and SOM was
provided. In this chapter, the purpose of this study will be discussed in more detail. The
proposed algorithm (GCC) will be theoret ically discussed. An overview of the experiment s
and their objectives will be presented as well as an example to clarify the experiments.
An important problem in supervised learning is the effect of insdicient training samples
in classification performance. This problem is common for most classification methods and
it is one of the reasons that these methods are costly or sometimes not applicable. In
practice, only a limited number of labelled training samples can often be obtained, since
they typically must be labelled manually.
Usually both the classification and feature extraction stages of each analysis are based on
optimixation of parameters that must be estimated using training sâmples. If the nuniber
of labelled samples is srna , both of these stages might s a e r from high variance in the
parameter estimates and the result of the whole analysis may not be satisfactory. Another
problem caused by small sample size is unrepresentative labelled data. The training samples
fkom neighbouring regions of data space may not be a good representation for the samples
of the same class in other regions.
The purpose of t his thesis is to examine a technique which reduces the problem caused
by insufEcient training data. The use of unlabelled data sounds reasonable since it is
easier to obtain. It has been proven (in both theoretical and practical aspects) that under
certain conditions, unlabelled data carry usefid information about the underlying function
((Castellï and Cover, 1995) and (Shahshahani, 1994)). Use of unlabelled data in the design
of classifiers could be useful:
O to reduce the variance of the parameters which results in better estimates
to obtain statistics that are more representative of the true distribution of the samples
to obtain prior knowledge about the distribution and statistics of the dataset which
can further be used in the classification process.
3.2 Implementation
This thesis is an examination of the use of unlabelled data in categorization problem as
well as the degree of satisfaction of using this technique on different rd-world datasets.
The specific approach that will be described in this thesis is based on a combination of two
well-known Iearning algorithms: the Self Organizing Map (SOM) and Bad-Propagation
(BP) . Since this approach deah with unlabelled data, an unsupervised learning network is
necessary to carry out the clustering task. The SOM has been selected due to its unique
properties. SOM is capable of doing both data and dimensionality reduction at the same
time without ushg prior information conceraing data distributions. In SOM, data is ordered
into units of a map in such a way that, in this ordered rnap, similar data lie close to
each other. The resulting map can be used for visualization of the data and will provide
information about the statistics and distribution of the data. In this approach SOM is
trained using a fhite number of labelled data to make an initial classifier. Subsequently, the
resultant ordered map dong with the initial classzer are used to assign labels to unlabelled
data so as to provide self-labelled data. The resdting self-Iabelled data will then be used to
reformulate the clusters and to provide training data for the chssification task. BP is used
for the classification procedure and testing the self-labelled data-
There are several assumptions underlying this technique: i) input data naturally falls
into clv.stzrc k t e a d of beirig distributed across the entire data space, ii) al1 data points
in these clusters correspond to a specïfic class, iii) there is a one to one correspondence
between dataset and classes, and iv) labelled data and unlabellecl data are from the same
distribution. It is necessary to point out that the above assumptions are common for most
approaches.
To ground the theoretical and practical aspects of this technique, and to provide a
background for the algorithm, it is necessary to define some notations.
3.2.1 Data
Consider a dataset X which represents the input data for the problem. The elements of X
are assumed to be vectors of real numbers with dimensionality n- X is denoted by
Due to the fact that it is a classification problem, each data vector is associated with a class
label. Consider L to be a f i t e set of labels that would be assigned to a training samples
according to the function
where 1 is a real world function with the god of system being to learn it to a satisfactory
degree.
The data set X is partitioned into two disjoint subsets, Y, labelled data, Y', unlabelled
data. For VZ E Y, l(2) is considered to be known and for VZ E Y', unknown. The input of
the GCC system could be:
just the labelled data Y, in this case the method is termed Conservative
union of labelled and unlabelled data Y U Y', this method is callecl Alliance.
3.2.2 Labeliing Process
As already noted, the SOM is used to assign labels to elements of the unlabelled subset Y'.
The term labelling process refers to the process of assigning labels to SOM nodes, which
resdts in assigning labels to unlabelled vectors as well. This process is done as follow. The
SOM consists of a set of nodes arranged in a two dimensional space
The SOM is trauied using an input dataset (either Alliance or Conservative methods).
During the training process, the SOM orders the map in such a way that the topoIogy
in the output space corresponds to the topology in the input space. The training process
is completed by mapping input vectors to the nodes. This process is represented by the
clustering function fc denoted by
If fi is a representation for the inverse of the clustering function f,, then f:(s) is the set of
input vectors associated with a particular SOM node,s.
If the node that an unlabelled vector has clustered to it is assigned with a label, then
the unlabelled vector may be assigned the same label under certain conditions. Unlabelled
data can be clustered to a node s where:
al1 the labelled vectors that have been assigned to that node have the same label, 1,.
In this case, the SOM node and all the previously seen udabelled vectors in node s
will be assigned to the label Zs (Figure 3.1).
where l€ ' ) presents a concluded labelling function for 2 E Y'. This process is referred
as fist order la belling-
all the labelled vectors assigned to node s do not have the same label. If that is the
case, then the label for that node cannot clearly be identified. This node is referred to
as a non-labelling node (Figure 3-1) and the labels for the unlabelled vectors clustered
to that node cannot be identified.
no labelled vectors have been clustered to node s, then the node is referred to as
an undefined node (Figure 3.1) and the label of the unlabelled vectors will remain
unkn0W-n-
The term ambéguous refers to nodes that are either non-labelhg or undefiraed.
In sllmmary, Y{'} a subset of Y' elements that have been clustered to non-ambéguow
nodes can be represented by
rectrngular neig h bourhood
cluater-clou 2& 1 undefined non-labelling
Figure 3.1: Labelling Process
Second-onier-Labelling refers to a process where the SOM is retrained with the original
labelled data in addition to the new self-labelled data. These self-labelled vectors can be
used to reorganize the SOM clusters the same way the original Iabelled data were used.
This process can be extended to an arbitrary depth. However, typically at a point where
no major changes can be seen in the amount of rexnrrining unlabelled data, the re-labelling
process is deemed to be terminated f point of diminishing return) .
3.2.4 Neighbouring
As was previously noted, the SOM has the unique characteristic of assigning similar train-
ing patterns to neighbouring nodes. The shape of the neighbourhood depends on the SOM
topology (eg. rectangular, hexagonal). This results in a special relationship between neigh-
bouring nodes that can be used to extend the ability of SOM to provide more labels for
unlabelled vectors. These labels will be assigned to the nodes that previously have been
comidered undefined. Consider S' C S where elements of Sr are assumed to be neighbours
of the node s = ( i , j)
The possibility of assigning labels to undefined nodes can be investigated by ex-g
neighbouring nodes. An undefined node s could be located where:
all the neighbouring nodes which have a label assigned to them all have the sanie
label. Then the undefined node would be assigned to that label (Figure 3.2). As a
result, any unlabelled vectors clustered to this node will be assigned with the same
label.
VZE Y A V S E S n fC(x) fC(x) = 1.. (3-8)
multiple labels are assigned to the neighbouring nodes (Figure 3.2). No label will be
assigned to it (umbiguous node).
all the neighbouring nodes are undefined. In this case, no labels are assigned to it-
The Neighbouring procedure can be repeated iteratively until there are no undefinecl
nodes in the neighbourhood of labelled and newly added self-labelled data.
Figure 3 -2: Neighbourhood Process
3.2.5 Classification
Similar to any other classification technique, BP s a e r s from poor estimation and unrep
resentative training samples. These problem have been addressed in (Baum and Haussler
1989) and (Muller et al., 1996). The objective of the studies presented in this thesis is to
overcome the difficulties in the supervised learning procedure (BP as an example) caused
by insuflicient training samples. Several methods of exploring Iabelled data from unlabelled
data have been int roduced in previous sections. Subsequent t O labelhg unlabelled data
to a desired degree, d, it is tïme to test the validity of the approaches by evaluating the
performance of BP tralned on the original labeLled patterns, as well as on the newly added
self-labelled data. BP is trained on input vectors wfiich are a subset of X. Elernents of this
subset are ail labelled and in particular the set U Y (where d is the desired degree of
re-labelling and y'Id3 is the self-labelled data) instead of the conventional approach which
was simply Y.
3-2.6 The GCC (Guelph Cluster Class) Algorithm
This approach is termed " Guelph Cluster Chss (GCC)" which was originally coined by
Stacey, Kremer, and Dara. Provicling required background information in previous sections,
the siimmary of the GCC is as follows:
Algorithm 3 The GCC Alnorithm
i: train the SOM using Conservatiue or Alliance method
2: apply first-order-la bellirag to obtain self-labelled data
3: add self-labelled data to original labelled data
4: apply second-order-la belling and neigh boaring methods to obt ain more seslabelled data
5: apply step 4 until no major changes could be seen in the remaining unlabelled data
6: train and test supervised learning network with the original labelled data and self la-
belled data
3.3 Initial labelled Data Selection
Most of the studies that apply unlabelled data to improve supervised leaming, use labelled
data for their approaches. Labelling is a costly process and it is very important to h d a
method to select the srnallest and the most effective data patterns as labelled data (active
learning). Unlabelied data may provide useful prior knowledge about the distribution and
statistics of the data which can be used to select labelled patterns. In this study, the
information fkom the clustering stage of the SOM is used to select initial labelled data with
the highest quality and lowest quantity. When training of the SOM network is complete,
nodes that had unlabelled patterns assigneci to them can be used for data selection by one
of the following strategies:
Method 1: one or more data from all the nodes.
Methoü 2: considering the density of data pattern in one region the number of
selected items may be different- If density in one region is high, smaller number of
data will be selected fkom that region, on the other band, larger numbers of data
patterns will be selected from low density regions.
Method 3: one or more fkom randody selected nodes. In this case the number of
selected data patterns should be specified. All the neighbours around the selected node
will be blocked (to make sure no more data will be selected from that neighbourhood.)
Each strategy has advantages and disadvantages that are important to consider during
the experiments. Resulting labelled data by Method 1 is sufEcient for classification pro-
cedure without using the GCC system. In this method, patterns are selected fiom al1 the
nodes in input space. The resulting labelled set would have the same characteristics as
the original dataset and classification performance rnay be at its highest level. A major
drawback is the large number of labelled data that are selected using this strategy. In the
second strategy, not ody the size of selected labelled data is much smaller, but also the
dataset is highly effective when using the GCC system. This approach is not effective when
the data is scattered all over the input space. Method 3 has the advantage of reducing the
number of selected labelled data to the desired degree. Because of randorn selection in this
technique, selected labelled data may not effective.
3.4 Experiments
GCC is applied on six existing datasets, since there were no established benchmarks for the
use of unlabelled data for classi£ication. To explore different aspects of the GCC system
and its abrlity on using unlabelled data, these datasets are selected based on their different
statistical properties (dis tribution, distance relationship of the classes). In order to estimate
the accuracy of the system in each step, the chosen datasets are all labelled and divided
(randomly with a similar distribution) into three portions: labelled, unlabelled and test
data- For dabelled data, the labels are set aside and never used during the process-
The number of labelled data are reduced to make the classification problem more di£Eicult-
Several sets of experiments are performed on GCC. Each experiment has been designed
with some objective which will be addressed later in its correspondhg section-
The experiments were as follows:
Experiment #l: explore the applicability of GCC on different datasets
Experiment #2: test the effectiveness of the Consemative and Alliance methods
Experiment #3: test the effectiveness of the second-order-labelling and neigh bouring
methods
Ezperiment #4: explore the GCC system performance on selected labelled data and
selection st rategies
3.5 An Example
In order to highiight the problem and investigate the effect of proposed solution and methods
in previous sections, all the experiments have been executed on six benchmark datasets.
Each one of these datasets represents a different statistical data space. This section presents
the results for the (IRIS) dataset (as an example) to c l a r e and ease the understanding of
the experiments .
The IRIS dataset (Fisher, 1936) is a well know benchmark dataset that consists of four
characteristics of iris plants and classifies them into three classes of iris with 50 exemplars in
each class. One class is linearly separable from the other two which are not linearly separable
fiom each other. BP networks can be trained to over 90% accuracy with 75 examples in the
training set and 75 examples in the test set. As was previously mentioned, the SOM offers
excellent visualization capabilit ies and techniques to compare input data items. To analyze
the performance of the GCC system based on different datasets, the SOM is used to obtain
prior knowledge about distribution of the data and statisticai relationship of classes- This
knowledge can be obtained by training the SOM network with the training datasets and
plot ting classes separately. Figure 3.3 is a schematic representation of the IRIS data space.
For this dataset classes are clustered h t o different regions with few overlaps.
7 I
duster 3 + ,Cluster 2 X
6 - Cluster 1
MAP SIZE (X dimension)
Figure 3.3: Iris Dataset Distribution
In Experhents 1, 2 and 3, datasets are randody (without using prior knowledge)
divided into labelled, unlabelled, and test datasets. During the experiments, all the labels in
unlabelled data are discarded and not used during the procedure. However, it is important
to note that in the analysis stage of the GCC, prior knowledge about each node, each data
pattern and its tnie class are used to ded with them individually. This consideration helps
to collect more information about the types of errors that happen in the system. In addition,
to estimate the accuracy and performance of the system in each stage this information is
necessary to make a Confwion Matriz-
During the GCC procedwe, fint, the SOM is trained using either the Alliance and
Conseniative methods. Then, during the Labelling pmcess nodes are assigned a label. Un-
labelled data patterns that are mapped into labelling nodes are assigned a label and wllI be
added to the labelleci data to be used in the second-order-relabelling process. Those unla-
belled data patterns that have been assigned to ambigvous nodes will remain wiclassified
and will be passed to the next level for further processing- Tables 3.1 and 3.2 represent ex-
amples of confusion matrixes that are used to evaluate the GCC system in each step. These
tables show system accuracy and the ability to producing self-labelled data coniparing the
Alliance and Conseruutive methods which are based on first-order-labelling without using
the Neighboukg procedure. Please note that the size of the datasets (labelled, udabelled,
and test) and the network parameters for the experiments are given in the next chapter.
Table 3.1: Conservative method
Labelling ability = (l5/66) =%23
Labelling accuracy=(15/19)=%79
Class
1 2
3
Table 3.2: Alliance method
Labelling ability = (21/66)=%32
Labelhg accuracy= (2 l/23)=%9l
1
Tt is clear that some of the new self-labelled patterns have been mis-classXed (for exam-
ple in the Alliance method (Table 3.2), two unlabelled patterns are mis-classiiied). The data
patterns that belong to the vndefined column will be passed to the second-or de^-rela belling
stage. For each stage, the labelling accuracy and ability are separately calculated to fol-
2
8 0 0 1 4 0 0 2 2 0 0 2 7 1 3
3 undefined Total
22 22 22
low up the performance of the system. The GCC results CR the self-labelling procedure
are sirmmarized in Figures 3.4 and 3-5. These figures represent the results for Experiment
3, however, they can represent the overall performance of the GCC system, Alliance and
Conseruutive methods.
Initial stage
Methods
Figure 3.4: Self-labelling accuracy
Alriance - Conservatke ---------
Methods
Figure 3.5: Self-labelling ability
A labelling accuracy of 79% in the Conservutive method and 91% in the Alliance
method, demons trates that the Alliance method reduces the number of mis-classifications
(Figure 3.4). The same conclusion is true of the labelling ability of the GCC system (Fig-
ure 3.5). Moving from the initial stage, kt-order-labelling, thou& the use of Neighhr-
ing, second-order-labelling and Neigh boufing, the syst em's ability t O produce self-Iabelled
data increases (23% to 59% to 87%). In all the steps, the Alliance performance is higher
than the Consemative. These improvements are in aspects of both quantity and quality of
self-labelled data- These results are discussed further in the next chapter.
After assigning labels to unlabelled data to a desired degree, back-propagation is used
to test the GCC approach by evaluating BP performance on the original Iabelled patterns,
as well as on the newly added self-labelled data. The BP performance on the original
labelled data is considered to be a baseline upon which to make cornparisons against the
increasing amount of training samples generated by progressively adding proportions of
self-labelIed data. In this section, BP is trained and tested four separate times on IabeUed
data and proportions (randomly divided) of self-labelled data. The results are siimmarized
in Figure 3.6. It is clear that by adding more self-labelled data, BP show higher accuracy
on the test data. For example, for the Alliance method with 9 labelled data (73%), test
performance increases to 78%, 80%, 88% and 91% by adding self-labelled data.
O 10 20 30 40 50 60 70 80
Labelled+Setf-labeiled data
Figure 3.6: BP performance on Test data, IRIS dataset
In Experirnent 4, the benchmark datasets are only divided into testing and training
sets and labels for the training set are discarded. Then, the SOM is trained on training
data and the resulting clusters are used to select the labelled and uilabelled data. As w a
previously mentioned, selection can be done with different strategies (Methods 1 & 2 & 3).
When selection is complete, the true labels are assigned to the patterns to produce lableled
data. The number of labelled data obtained in Method 1 is large (34) for the IRIS data
compared to 9 in the previous experiments. It is important to note that one of the major
goals in this experiment is to reduce the cost of producing labelled data by selecting the
most effective, and at the same time smallest amount of labelled data. The same methods in
Experiments 1 and 2 (Alliance, Neighbourhood, and Second-order-Relabelling ) are applied
on the resulting labelled data. A summary of BP performance on these datasets is given in
Figure 3 -7. Met hod 1 is a powerful technique which results in a lower mis-classified patterns.
However, comparing the 34 labelled data obtained in Method 1 with 17 in Method 2 and 7
in Method 3, as well as the classification results presented in Figure 3.7 (92%, 92%, go%),
shows that the use of Method 1 for some of the datasets may not be necessary. The Merence
between the highest performance with 7 labelled patterns and 34 labeUed patterns is 2%
which can be ignored when the cost of labelling is five times lower-
40 I I I * I 1 I
O 10 20 30 40 50 M) 70
Labellad-eSelf-labelled data
Figure 3.7: Test ing performance (BP network) on selected IRIS labelled da taset
In conclusion, it can be said that the GCC system was capable of producing sacient
self-labelled data to improve the classification procedure for IRIS data. The rest of the
experiments will be discussed in the next chapter.
3.6 Conclusion
It wodd have been interesthg to find a way to investigate to what degree the function
is an accurate representation of the original function 1, and as a result of that, to examine
the effectiveness of this approach and validity of the labels assigned to dabelled data. In
practice, the effectiveness of this technique could not be evaluated based on mathematical
concepts. Basically, the performance of the system is highly dependent on the clustering
accuracy of the SOM for each particular dataset. Depending on the statistical relationship
of the classes in each dataset, they might be clustered into separate parts of the SOM
output space or with a high degree of overlap. As a result, there is a risk of the occurrence
of incorrect label inference (mis-classification) for the unlabelled data- The probability of
mis-classification decreases with the increasing nuniber of labelled vectors clustered to each
node (according to the law of large numbers).
The goal of this research as well as the problems were discussed in previous sections.
In addition, a novel algorithm (GCC) and its theoreticd and practical aspects were ex-
plored dong with an example. The following chapter contains detailed information on the
experiments such as: the size of the datasets, the network parameters and the rest of the
results. In Chapter 5, advantages and disadvantages of this study and proposed solutions,
mis-classScaiion and system performance on dXerent datasets will be analyzed.
Chapter 4
Experiment s
4.1 Introduction
In the previous chapter, the theoretical aspects of the problem, proposed solutions and their
objectives were ex;rmined. To validate these solutions, a number of experiments were exe-
cuted on several benchmark datasets. Each experiment was designed wit h some objectives.
Some asswnptions have been made during each simulation, and each dataset was chosen for
a speci.6~ purpose. Al1 these issues will be discussed in detail in this chapter. The empirical
results for the IRIS dataset were presented in Chapter 3 to provide a general understanding
of the simulations. The rest of the results and their andysis will be presented in this and
following chapters-
4.2 Data Collection and Descript ion
To investigate the performance of the Labellng, Neigh bouring and Re-labelhg procedures,
each dataset was selected in such a way that each one represents a difFerent statistical
characteristic. The resdting information can later be used to predict the performance of
the GCC system on other datasets.
Data Description
Eschenchia coli is one of the members of the coliform bacteria group normally found in
human and animal intestines, and indicative of fetal contamination when found in water.
Detennination of E. caii presence is often used to measure the microbiological safety of
drinking water supplies. A dataset collected by researchers in Agriculture Canada and the
University of Guelph consists of 228 samples of 13 inputs each, where the inputs are the
results of such tests as "tirne to detection after 3 hours exposure to acid" . In previous work
(Stacey, 1998) an 85% accuracy was achieved for a BP network trained with 50% of the
dataset and tested on the rem;ririilig data- This dataset serves as the bas& for the second
set of experiments.
The next dataset is Dr. William W. Wolberg's Wisconsin Breast Cancer data. This
dataset contains 699 samples with 458 (65.5%) samples in the class Benign and 241 (34.5%)
in the class Malignant. Each sample has 9 input features. Previous classXcation work
by Zhang (Zhang, 1990) achieved 93.7% accuracy using only 200 instances for tr;LiniTig a
1-neares t neighbour algorithm.
The Heart Diseuse dataset consists of 920 measurements of heart problem indicators
collected fiom the Cleveland CIinic Foundation. Each measurement consisted of 14 features.
Other researchers (Detrano et al., 1989) have reported a 77% classification accuracy with a
logis t ic-regression-derived discriminate function on t his dataset when training on two t hirds
of the data and testing on the remaining third. There are two categories: no heart-dls 1 ease
present (509 exernplars = 55%) and heart-disease present (41 1 exemplars = 45%).
The Mushroom dataset consisted of information about weU known mushrooms fiom the
Audobon Field Guide provided by Schlhmer (Iba et al., 1988). This dataset contains 22
attributes per exemplar with both quantitative measures as well as qualitative ones. There
are two categories of mushrooms considered: definitely edible, and definitely poisonous or
irnknown. There are 8124 exemplars: 4208 (51.8%) edible, and 3916 (48.2%) inedible. Iba
achieved approximately 95% classification accuracy on tbis set using 1000 instances for
training t heir HILLARY algorithm.
The fmal dataset is the National Institute of Diabet es and Digestive and Kidney Diseases
Database. It consists of 768 exemplars in two classes: 'Lpatient tested negative for diabetes"
(500 exemplars, 65%), and "patient tested positive" (268 exemplars, 35%). There are 8
attributes used for prediction, Using 576 training examples, Smith et al. (Smith et al.,
1988) achieved 76% accuracy, using their ADAP algorithm, on the r e m k g 192 instances.
Datasets Order Display
The SOM is an unsupervised t ool for aut omatically arranging high-diniensional s tatistical
datasets so that similar inputs are mapped close to each other. This map can be helpful in
exploration of the dataset by visualizing the data space, providing the desired information,
and revealing surprishg distance relations between different items of the dat ase ts. This
information may be used to predict:
How the La belling, Neigh bouring and Re-la belling procedures work.
The range of improvement in classification procedure using the GCC approach.
Which strategy (discussed in the previous chapter) is appropriate for selecting the
labelled data.
To ease the discussion and obtain a general view of the information contained in the
datasets, the resulting maps of the trained SOM (on training data, both labelled and un-
labelled) are presented here. Each map represents the data space and distances between
classes-
1 I 1 I r , Cluster 2 + Cluster 1 X
- * + + + + + + + + + + X W ++:< + X f 3- X Overlap :< X X X X X X + X X,Y :< 4- ;<
- X X X + X + ;< x + :< X X + + X + x :a:
X X X + X + * + X ;< ;< 7k + X X * ,Y t + + + >: x + + X X x * x
X * X + + * + X + + + % X X X + + + +
X + + + + + + + + + + X x + + +
% X + + + x X - * % + + + + + + x + x i- X , # I I
O 5 1 O 15 20 25
MAP SIZE. X dimension
Figure 4.1: E. Coli data
+ + x * s x : < : < X X
:< x x x ,'. :< . y
4- X X X X X X x X
+ X X X X X X X
m + x x x X + % X X X
x X X X X X x + + * : < 'X x x x x x x : < i:
x >< :< ;-; x :< + x I ( x ; - . . ,
,\ . .. I\
X 'X x :< + + * x x x x : - : X
+ ? K + X X X X % X X X X
16
s 1 a 15
MAP SIZE. X dimension
I 1 I I - Cluster 2 + Cluster1 :< '
Figure 4.2: Breast Cancer data
Figure 4.3: Mushroom data
20
c O .- tn 5 m E ü > g- IO V)
4 I
5
Fi,we 4.4: Diabetes data
l 1 1 L 1
Cluster 2 + Cluster 1 X
- 5 + + + + + + + ;< x x + + + + + + + + + + + + + + + +
1 5 - f f 4- + + f f X + ,Y . , /\
+ + + + + + x 2K + f % ;< x f f f + ?#$ t3verlap X >: X X + + X :< + X X
X * Y: - + + 4- 4- + x X x + + x X + + x X :< ;< X X X X + X X X X -. I
.4
- x x + X X X X 2: x x X + :< X + + + + + + + + + + + X X X + + + + + + X X X X x
O - + + i-+ -t+- i -+++ X x X x X I
O I 1 1 I ,
5 10 15 20 25 MAP SIZE. X dlmension
t - Y +
, , x ; H- , * , x * +-, d . - O 5 10 15 M 25 30 35 40 45
MAP SIZE. X dlrnension
Figure 4.5: Heart Disease data
Datasets differences can be recognized by comparing Figures 4.1 to 4.5. For example, in
Figures 4.2 and 4.3 (Breast Cancer and Mushmom), the classes are clustered into separate
regions of the data space with the exception of a few overlaps. On the other hand, in the
E. wli, Diabetes and Heart Diseuse cases (Figure 4.1, 4.4 and 4.5), the classes are hardly
separable and they have many overlaps. The use of this information will be pointed out in
the appropriate following sections-
4.3 Experiments
Four sets of simulations were executed, each one with a difFerent purpose. The objective
of the first three experiments was to examine the GCC (Guelph Cluster Class) approach,
its variations and their performances on di£Ferent datasets. The Iast experiment focuses on
labelled data selection and difFerent strategies for selecting the most effective labelled data.
The selected datasets were al1 labelled. To run the experiments, they were divided
into training and testing data. Training datasets were later partitioned into labelled and
unlabelled data- All the selections were random with similar distributiom- For unlabelled
data, the labels were set aside and never used in the process. Beside some parameters
that have been varied during each experiment, there were also parameters wi th fixed value
that were chosen based on experiences in the k t experiment. For example, the labelled
datasets selected for Experiment 2 and 3 were selected based on preliminary results obtained
in Experiment 1. The major effort during random selections in experiment was to reduce the
number of Iabelled data in such a way as to make the classification problem more dif6cult.
Evidence fiom Experiment 1 was usefd in the selection of SOM and BP parameters
during Experiments 2 and 3. SOM and BP parameters values were &O set based on exper-
iments in the literature. Since BP is a very popular neural network, guideline references for
its paramet ers can be w i l y obtained by the reader. Detailed guidelines for how to actudy
choose SOM parameters are given in several publications by Kohonen and Kaski (for exam-
ple (Kohonen, 1997)). They contain information on computing the proper maps- Topology
preservation of the input space is quite dillicult to define. Two different approaches for
measuring the degree of topology preservation (by SOM) are reviewed in (Kaski and La-
gus, 1996). Prior knowledge about the data can be used in choosing the SOM features,
(Joutsiniemi et al., 1995). Scaling of the data is very important before applying the SOM
algorithm ((Kohonen et al. 1996, ) and (Kaski and Kohonen, 1996))-
4.3.1 Experiment #1
This section covers detailed information about the size of datasets, SOM and BP parameters
and selection of labelled data, on several prel;m;nary simulations that were carried out on
the benchmark dataset. The main objectives of this experiment was i) to randomly select
and reduce the number of labelied data to make the classification problem more difEcult,
ii) to examine the overall performance of GCC on different datasets.
For each dataset, the data were randomly partitioned into training data and test data.
Shen, the training data were again randomly subdivided into labelled and unlabelled data.
For the unlabelled data, al1 labels were discarded. The sizes of the labelled, unlabelled and
test datasets are given in Table 4.1.
Table 4.1: Number of Labelled, Unlabelled, and Test data
1 E. wli I I I I I l
32 106 1 90 11 228 1
Data Sets
IRIS
1 Diabetes 11 65 1 447 1' 256 11 768 1
Test
75
Breast Cancer Heurt Disease
TabIe 4.2: Size of Classes in Labelled and Unlabelled data
Labelled
9
Total
150
1 Breast C a n F I
Unlabelid
66
50
65
1 Mushroom 1 1 Xeart Disease 1
389
547
Labelled Num Class 1 I Class 2 1 Class 3 1
260 308
Unlabelled Num
669 920
Class 1 1 Class 2 1 Class 3 1
AU data for labelled and udabelled data were selected randomly, however, a bdanced
number of samples fiom each class was maintained (Table 4.2). A BP network can perform
very accurately even with a small number of training data if the data contains suflicient
information. Consequently, a large number of labelled datasets were selected at the begin-
ning of this experiment and they were gradually reduced to make the procedure harder.
This reduction was continued to the point where the BP network was not able to perform
very well with only the labelled data. f t is important to note that since a l l the selections
were random, this procedure was very tricky. The distribution, intraclass and interclass dis-
tances, and other statistical properties of the selected patterns were unpredictable as weil
as the GCC system performance. The resulting labelled datasets were used in Experiments
1, 2 and 3.
Selection of network parameters was based on d g different simulations and the
existing literature. For each network, BP and SOM, parameters were varied for Werent
datasets. However, they were fked during the experiments on a speciûc dataset. These
h e d parameters were chosen based on the optimal result from each network (Table 4.3).
During the simulations, training and testing results were based on an average of four trails-
Since no great variation in performance on particular datasets was obsemed, ody four triah
were conducted and efforts were directed to examinhg a variety of difTerent datasets-
Similar to the example given in the previous chapter, a Conjùsion Mat* was used
to calculate the accuracy of the procedure for each step. To examine whether the GCC
procedure c m improve classification, the resulting self-lableled data were randomly divided
int O different proportions (wit hout using prior knowledge such as: distributions, classes,
etc.). Then, training data was increased by adding these sets of self-labelled data. Since
the baseline training data for BP is very srnall (original labelled data), its training tirne has
been set to a small value (Table 4.3) in order to avoid over training as the training data
was gradually increasing-
Table 4.3: Networks Pararnet ers
Networks BP Datasets L.R. 1 Input Num Output Niim Hidden Num Epoch
IRIS 0.1 1 4 3 3 400 E. coli 11 0.05 1 13 1 - 2 1 5 1 &O0 Breast Cancer 0.1 9 1 3 200 Mushroom O. 1 125 1 3 200
7
SOM L.R. Dimension Radius Epoch
1
i 0.1 8x4 5 iO000 1
Results of this experiment demonstrate that the GCC procedure can successfuily im-
prove the classification for different statistical datasets. The E. coli dataset was one of
the most chdenging datasets used for experimentation- (Figure 4.1. demonstrates t hat the
interclass distances between its items are small and hardly separable-) 32 labelled s m -
ples from this dataset were selected as training data. As cari be seen in Table 4.4, GCC
was only able to achieve a maximum 63% ability to self-label with an accuracy of 77%
(20 mis-cIassified data). Thus it would be assumed that GCC was not going to be able to
improve classification. However, on examination, (see in Figure 4.6) it can be seen that the
BP network trained with the labelled data (32 samples) and the self-labelled data (total of
118, with 20 mis-classified data) has achieved a maximum classification accuracy of 73% on
testing. The BP performance on the orighal labelled data was 57%.
Another example of a different dataset was the Mushmom dataset. 90 labelled smples
were used for training. Because of the distribution of the two classes in this dataset (Figure
4.3), labelling ability and accuracy are very high, 96% and 99% (Table 4.5). 99% for
labelling accuracy represents GCC7s ability to obtain correct self-labelled data when using
t his kind of dataset. By gradually increasing the self-labelled data in training, the accuracy
of the trained BP network jwnped from 53% to 95%. These results are shown in Figure
4.7.
Table 4.4: Confusion Matrix for Unlabelled data (E. coli)
Class
1 2
Labelling ability = (66/104)x100=63%
Labelling accuracy= (66/86)x100=77%
1
36 12
Total
52 52
2
8 30
undefined
8 10
Table 4.5: Confusion Matrix for Unlabellecl data (Mushroorn)
Labelling ability = (51 12/5303)x100=96%
Labelling accuracy= (5 ll2/5143)xlOO=99%
Class 1
GCC was capable of improving the classification procedure for all the datasets. De
pending on the dataset, the range of improvement was dif3erent. For example, in Figures
4.6 and 4.7, it can be seen that the range of improvement for the Mwhmom dataset was
significantly larger than for the E. coli dataset. The same conclusion can be made about
the range of improvement in other datasets. Results for the rest of datasets are presented
in Appendk A.
1
2530
2
25
undefined
3
1 Total 1 2597
I 1 I I I 1 l r test data -
30 40 50 50 70 80 go ioa 110 120
Labelled+SeM-labelled data
Figure 4.6: GCC ability improving classXcation of E. coli dataset
I I
tekt data -
O 1000 2000 3000 4000 5000 6000
Labelled + Self-labrslled data
Figure 4.7: GCC ability improving classification of Mushroom dataset
4.3.2 Experiment #2
Now, it can be said that the GCC approach is capable of using unlabelled data to improve
supervised learning. However, th& improvement is highly dependent on the quantity and
the accuracy of self-labelled data- The main objective of this experiment was to improve the
quality and quantity of the self-labelled data which hopefully would lead to an improvement
in overall classXcation accuracy. This objective was tested by training the SOM using two
dXerent datasets: (i) a combination of labelled and unlabeLZed data (Alliance), (ii) only
labelled data (Conservative). The size of the training sets and their classes are given in
Tables 4.1 and 4.2 in the previous section. The same strategies used in Experiment 1 have
been used in this experiment. The Conservative technique's parameters were the same as
those given in Table 4.3. In the Alliance technique, the network parameters are siimmarized
in Table 4.6.
By using the Alliance method, it was expected that there would be an improvement
in the accuracy of labelling ushg information on both labelled and unlabelled data. Table
4.7 surnmarîzes the results of the Alliance and Conseruatéue methods in both SOM and
BP networks- As was expected (in the first two columns of Table 4.7), SOM was able to
increase its labelling accuracy using the Alliance procedure. For exampIe in the E. coli and
Diabetes datasets the range of improvements was 8% and 7%. These improvements were
in both aspects of quantity and accuracy in the production of the self-labelied data. Please
note that no great variations in the performance of each network were seen.
Table 4.6: Network Parameters for the AIIiance method
1 Networks BP SOM Datasets
IRIS E. coli Breast Cancer Mushrootn
L.R.
O .2 0.005
0.2 0.2
Input Num
4
13
Diabet es
Hart
9 125
Output Num
3 1
0.01
0.01
1 1
Hidden Num
3
7
8
35
3 4
-pPp.---p.
Epoch
400
1000
1
1
150 150
L.R.
0.1
0.1
8
5
0.1 0.1
Dimension
8x4
20x1 5
1000
1000
18x15 30x25
Radius
5 18
0.1
0.1
Epoch
4000 10000
15 30
40x35
40x35
da00 1000
35 3 5
10000
9000
Table 4.7: Alliance and Conservative Results
Data Sets
Despite this improvement in the labelling procedure supervised learnuig accuracy did
not increase (third and forth col^ of Table 4.7 demonstrate BP performance on test
dataset). Although the self-labeliing accuracy for ail the datasets increased, the BP net-
work's accuracy remained unchanged. Except for the Mushroom dataset ( Consemative 93%
and Alliance 98%). The reason for this exception was the 233 correct patterns that were
added as self-labelled data in Alliance procedure. For the rest of the datasets, improvement
of the self-labelled data was not significant.
IRIS E. Coli Breast Cancer Mushroorn Diabetes Heart Disease
As was previously discussed, SOM has the tendency of assigning simiIar data patterns into
neighbouring nodes. Since the partitioning of SOM nodes into labelling and non-labelling
nodes results in a srnall proportion of the nodes being used for the labelling procedure,
neighbours of the original nodes are considered as labelling nodes if sufEcient evidence
is available (Neigh bouring procedure). An alternative technique to increase the quantity
of the self-labeiled data is re-labelling. This experiment was designed with the objective
of testing the effect of the Neighbouring and re-Zabelling techniques on the GCC system.
In this technique, the SOM is retrained using the original labelled data as well as the
new self-labelled data. Please note that re-labelling is also referred to as sewnd-order-
labelling. In both methods, the probability of having mis-classified data will increase. In
SOM (labelhg accuracy) Alliance 1 Conservative 1
BP (test data) Alliance 1 Conservative
91% 60% 97% 96% 57% 75%
87% 52% 89% 92% 48% 68%
91% 73% 97% 98% 75% 79%
89% 73% 96% 93% 75% 80%
the Neighbounng procedure, neighbours of the labelling node rnay contains data fkom a
difEerent class. Another problem is that the resulting mis-classifieci data from first-order-
labelling may mis-lead the re-labelling procedure. These problems dong with the effect
of the techniques were investigated by conducting the appropriate experiments on all the
benchmark datasets. The size of Iabelled and unlabelled datasets and class composition in
this experiment are given in Tables 4.1 and 4.2. SOM and BP were trained with the same
parameters as shown in Tables 4.3 and 4.6.
In the initial stages of this experiment, the self-labelling process was performed without
using the Neighbouring or Re-labelling techniques. These results were used as a baseline
to evaluate the effect of these techniques in the self-labelling process. Comparisons were
done for both the Conservative and Alliance methods. The training datasets used for the
classification stage (BP), were the ones with the highest accuracy and largest number of
self-labelled data obtained from these comparisons. For example, if the highest accuracy of
self-labelled data belonged to the experiment using Neighbovnng and re-labelling techniques,
then the resulting self-labelled data (fiom this simulation) were used to train and test the BP
network. To siimmarize the results of this experiment, Diabetes and Breast- Cancer datasets
have been selected as a representative for datasets wit h different statîstical properties. SOM
performance on the self-labelling procedure is presented in Figures 4.8 and 4.10. Figures
4.9 and 4.11 are the summaries of the BP performarice on the test data.
I 1 I Alliance - Conservadve -----
- Initial stage Ndghbauring Nelghbourlng+Relabdling
Methods
Figure 4.8: Labeiling ability through difTerent techniques on the Breast-Cancer dataset
110 1 1 1 1 1 1 4 Alliance -
Consenfative -------
1 5 0 200 250
Labelled+SeH-labetled data
Figure 4.9: GCC abiLity on irnproving classification of Breast-Cancer dataset
Initial stage Neighbourlng Methods
Figure 4.10: Labelling ability through different techniques on the Diabetes dataset
1 1 I , 1 r I I r Alliance -
Ccnservative -------
LabelledcSetf-labelled data
Figure 4.11: GCC ability on improving classification of Dia betes dataset
For the Breast Cancer dataset, 50 labelled samples were used for training. As the self-
labelling process progressed fiom the initial stage through just the neighbourhood use and
to the stage of the use of both neighbourhing and re-labelling methods (Figure 4.8)' the
GCC system was able to gradually increase the quantity of the self-labeiled data (fiom 38%
to 62% to 91% for the Conservative method and fkom 47% to 75% to 97% for the Alliance
method). This self-labelling was extremely accurate (high 90's). When the BP network
was traiued with the labelled and self-labelled data, accuracy of the trained network is
comparable to systems trained with more data (Zhang used 200 samples compared to 50
samples). These resdts are sirmmarized in Figure 4.9.
The Diabetes dataset represents one of the most challenging problems for the GCC
system. 65 labelled data were used for training. GCC was only able to achieve 60% ability
in the self-labelling process (Figure 4.10). The BP network trained with only 65 samples and
self-labelled data achieved a maximum classification accuracy of 75% for the test dataset
(Figure 4.1 1). Since the selection of self-labelled data for different training sets was random,
the nuniber of mis-classified data in each BP network training data varied and it was not
controlled. A sudden drop in the BP network performance (from 70% to 67%), Figure 4.11,
was because of the mis-classifieci patterns that had been selected for the training set. In
conclusion, it can be said tbat even with some mis-classified self-labelled data, GCC is able
to produce sficient samples of properly classified self-labelled data to increase classification
accuracy for different datasets.
4.3.4 Experiment #4
Most studies use both labelled and unlabelled data to improve supervised learning. The use
of labelled data can be costly, since data samples have be labelled manually. Thus, it is very
important to select the most informative samples to be labelled. In previous experiments,
labelled data were randomly selected fkom an array of data patterns (ail labelled) without
using prior knowledge about the data or data space (there are problems encountered with
this selection method which WU be discussed in the following chapter). Instead of randomly
select ing the pat tems different paradigms for data selection have been introduced active
learning. In active learning, the samples are selected in such a way that they are expected
to be the most informative ones. The goal of active learning is i) to improve the efficiency
of the patterns with the aim of using less patterns and achieving the higher performances,
i.i) to improve the cost efficiency of data acquisition by labelling only those data that are
expected, iii) to facilitate training by removing redundancy from the training set. In the
following section a novel approach is discussed which encompasses the selection of labelled
samples by using the information £iom the clusterhg stage to facilitate and increase the
accuracy of the labelling procedure. This experiment was designed with the objective of
avoiding the problems in Experiment 1, improving quality and reducing quantity of the
labelled data.
In this approach, the training sets (Table 4.1) were used to train a SOM (all the labels
of these training sets were discarded) . Then, samples were selected based on la belling nodes
(nodes that an input vectcir had been assigned to them) by one of the following strategies.
Selecting one sample (randomly) from
al1 the labelling nodes, (Method 1)
the selected nodes based on the density of the data patterns in the data space (giving
the highest probability for selecthg fkom low density neighbourhoods than the nodes
in high density regions), (Method 2)
randomly selected nodes (when a node was selected, all its neighbourhood was blocked) , (Method 3)
To obtain the initial labelled data, selected ualabelled samples (£rom one of these strate-
gies) had to be assigned their correct labels. The remaining ualabelled patterns were con-
sidered to be unlabelled data. To compare these strategies the GCC system was examinec!
on the resulting labelled datasets. Similarly to the previous experiment, Alliance, Neigh-
bouring and Re-labelling procedures were applied to obtain self-labelled data. Then, a BP
network was trained and tested using the initial labelled data as well as the new self-labelled
data- The sizes of the labelled, unlabelled datasets and labeUed sample classes are given in
Table 4.8.
It is important to note that the objective of this experiment was to reduce the size of the
labelled dataset without reducing the GCC system performance. When progressîng through
the selection strategies, in Methods 1 to 3, the number of selected labelled data samples
decreased in each step (Table 4.8). For example, in the Breast Cancer dataset, the labelled
data ske changed fkom 110 in Method 1 to 33 and 22 in Methods 2 and 3. Even though the
labelleci data obtained in Method 3 was five fold smaller than the labelleci data in Method 1,
the GCC system was capable of achieving the same amount of accuracy on the classification
of the test data (high of 95%) as in Method 1. The range of improvement fiom the lowest
accuracy to highest in Method 3 is much larger (64%) than Method 1 (30%). Table 4.9
siimmarizes the results. In that table, for each Method, Column 1 demonstrates the number
of labelled data used in that method, Co1timn.s 2 and 3 show the lowes t Cjust original labelled
data) and the highest (labelled and self-labelled data) BP network performance on the test
dataset. In the Diabetes dataset, labelled data obtained in Method 3 is six fold smder than
Method 1, but at the same tirne, its accuracy (highest performance coliimriw) is only 7%
lower. Figures 4.12 and 4.13 demonstrate the results for the f i a r t Disease and Mushroorn
datasets which are very similar to results discussed for the other datasets. In conclusion,
selecting labelled data based on Method 1 may achieve higher accuracy, however, the cost
of labelling has to be considered. On the other hand, Method 3 results in a much smaller
dataset and reasonabIe accuracy with the GCC system.
l'able 4.8: Size of Labelled and Unlabelled data
1 Strotc~ies II Mcthod 1 1 Method 2 1 Mcthad 3 . 1 Data Sets
Q, B 1 1RlS
Labelled Class 1 1 Class 2
1 1 3 E. Coli Bread Cancer Mushroom
Unlabelled
3 1 68
Labellcd
25 57 4 8
Unlabelled
1 4 1 1
Class 3
13
Clzoa 1
7 3 5 53 52
Class 2
14
Labcllcd Ciam 1
3 78
329 6290
Clas 2 ( Chia 3
8 1 6 10 12 3 3
15 2 1 3 1
Table 1.9: Labelled Data Sclection Rcsults
Strritegies Data Sets
IRIS E. Coli Breocit Cancer Mushroorn
-
Method 1
60 110 100
Labcllcd data
34
Method 2
7 1 65 53
Method 3 Lowest %
90
Highcst %
92
Labclled data
7
Highcst % 92
Labclled data
17
73 95 99
Lowcst % 85
Lowcst %
66 25 33 64
Highest % 90
65 65 4 7
68 95 99
16 22 24
56 31 17
69 9 6 94
Figure
300 400
Labelled+Self-labelled data
4.12: Test ing performance (BP network) on selected Heurt-Diseuse labelled data
O 1000 MO0 30W 4000 SM30 6000
Labelled+Çetf-label1 data
Figure 4- 13: Tes ting performance (BP network) on selected Mwhmom labelled data
Conclusion
This research has shown how a self-organizing map can be used to produce self-labelled
data in sufijicient quantities and with sac ien t accuracy to enhance the training of a BP
network.
In preliminaq experiments, labelled and unlabelled datasets were randomly selected
based on the assiirilption that in real world applications, the practitioner has no control
over the labelled dataset. On the other hand, it was shown that it is possible to use a SOM
to seek out certain exemplars for Iabelling. If the labelling process is a costly process, then
information from the clustering stage can be used to assign labels to unlabelled samples in
the iabelling process. In addition, to validate the approach different datasets were tested.
Each dataset represented a di£Ferent data characteristic.
In conclusion, it can be said that the proposed approach appears to have significant
merit for integrating unlabelled data into the domain of supervised learning. Specifically,
datasets that have reasonably separable classes indicate significant improvement by adding
unlabelled data.
Chapter 5
Analysis and Discussion
5.1 Introduction
Given the results from the previous chapter, what can be concluded about the behavior of
the GCC system? Clearly, results show that GCC is capabIe of providing self-labelled data
wit h sufficient quantity and q d t y which can improve supervised learning. However, t here
are still lots of questions with respect to GCC performance. Why does the GCC system do
well? When does the GCC system do well? Could its behavior be generalized for different
datasets? If it can be generalized, what characteristics should a dataset have?
It is not an easy task to explain and generalize GCC system performance. Detailed
research, theoretically and ernpirically, is required to be able to answer some of these ques-
tions. This piece of work is basicalIy a stepping stone (an introduction) to this field and
it does not concentrate on answering al1 the above questions because of time limitations.
However, in this chapter, some of these questions are investigated over the performance
of the GCC system on the IRIS and E. wlo datasets. The IRIS set is selected as a toy
problem since it is a popular and small benchmark d a t ~ e t and easy to work with (Figure
5.1 shows the distribution of the IRIS data classes).
The key element in the GCC system that helps to improve supervised learning is the
use of unlabelled data in i;he procedure. In general, dabelled patterns contain information
about the problem (Nigam et al., 1998). They provide information about the joint probabil-
ity distribution over the items in the data space. In another word, unlabelled data show the
true distribution of the patterns. For example, using only a small set of the IRIS dataset as
labelled datato train the SOM network might result in a map such as the on in Figure 5.2.
With the knowledge of three existing classes in th% dataset, three major clusters can be
recognized on the map. Since there exist only three classes, the s m d cluster (represented
by "? ?" should belong to one of the major clusters. If the unknown cluster ("? ?") is
going to be recognized based on its distance from other clusters, then it belongs to Cluster
#3 (on the right hand side of the map). However, when unlabelled patterns are used in the
procedure (Figure 5.3), this estimation is violated. Distribution of the dabelleci data in
the data space (Figure 5.3) shows that there is a higher probability for the unknown cluster
to belong to Cluster #3 or #2. This simple example shows how unlabelled patterns can be
helpfd to avoid wrong estimations in clustering and as a result of that in the classification
procedure.
Figure 5.1: IRIS data ordered map (training data)
7
6
5
r .- : 4 - c
i?i .- 'O
2 3 - U1 N <I)
9 2 - z
1
0
-1
MAP SlZE (X dimension)
i 1 1 I r Cluster 3 f Cluster 2 X
- Cluster 1 X
- X :< X x x x
:< < x x * + + X * + + + :< x ;#
- + + + '4 X X
- + 3- + + *&wrap* x X
1 1 I
O 2 4 6 8 1 Q
K l 4 1 i
dusters (labled data) +
4 6
MAP SlZE (X dimension)
Figure 5.2: A sample labelled data distribution
7
6
5
Figure 5.3: A sample labelhl and wilabelled data distribution
I I I I
labded data f unlabelai data X
-
- + + + X ' x :< - C - =: 4 - 5
E e t 3 - W N CA
2 - P 1
0
- 1
x f + + + duster 2
+ + + FIUster
t + X x x + - , ,
,-, P- ' + x X x
- X + X ;< X x + ? * I I 1 I
O 2 4 6 8 10
MAP SIZE (X dlrnension)
In addition to distribution, other statistical characteristics of the dataset such as intr*
class and interclass distances play an important role in the GCC system performance- In a
dataset where the intraclass distances are small and, at the same time, interclass distances
are large, the GCC system and its variations' performances improve significantly. Based on
the results in previous chapter, performing selection on the labelled data, Neigh bouring and
Re-labelling techniques would be more escient (snialler number of labelled data and lower
number mis-classified) on the datasets with this type of distance relationship between their
classes.
In a research by Castelli and his colleagues, (Castelli and Cover, l995), they show that
labelled samples are exponentially valuable in reducing the risk and error in the pattern
recognition field, despite unlabelled data that can only be polynomially valuable. They
prove that infinite unlabelled data, alone, can only be used to estimate single component
distributions. It cannot be used to constmct a classification rule. With the use of a mixture
distribution function and an error model, they prove that the probability of error (in a
classification procedure) with no labelled data and infiriite unlabelled data is 1/2. However,
when the number of labelled data points increases, error reduces exponentially and when
there is an infinite amount of labelled data, almost all the components of the mixture
distribution function can be recovered. The size of labelled and unlabelled datasets shodd
be considered when using unlabelled data in supervised learning procedure. In a problem
with infiriite labelled data, wilabelled data does not aid the reduction of classification error.
If there is already a sufficient amount of labelled data, all the parameters can be retrieved
fkom just the labelled data and the resulting classifier is BP-optimal. The effect of labelled
and unlabelled dataset size in supervised learning has b e n discussed in detail in (Nigam
et al., 1998). The discussion on the performance of the GCC system and its variations will
continue in the following sections. Empirical results on the effect of labelled and unlabelled
dataset size and distribution are presented by an example.
5.2 Analysis of Experiment #1
In general, using a small set of labelled training data for ciassification, accuracy will suffer
because of hi& variance in the parameter estimation procedure. aowever, it is important
t O note t hat given appropriate training data, BP is capable of approximating any functions
to satisfactory accuracy. As it was previously mentioned, one of the objectives for this
experiment was to select and d u c e the size of labelled data in such a way to make the
classification procedure harder. The labelled data was randomly selected fiom the training
data. The rem;rining data was considered as unlabelled data. Random selection of labelled
data resulted in different and unpredictable performances for self-labelling process and the
GCC system. This rinpredictability could be caused by: infinite labelled data, the distance
relation between labelled data items, and the labelled data distribution (if it is a reliable
representative of data space or not).
The size of selected labelled data and its distribution were major problems during this
experiment. Very small data size did not necessarily make the classification procedure
harder. Sometimes, with a small set of labelled data, BP was capable of producing an
accurate classifier. On the other hand, a large number of labelled data that represented a
small region of input space or suffered ikom high variance made the classification procedure
harder where even the GCC system could not be useful. This problem is even worse when
working with small datasets such as IRIS or E, coli. For example, the IRIS dataset conta.ins
150 patterns which was divided &O 75 training and 75 testing patterns. Then labelled data
has been selected from 75 training samples. First 25 patterns were selected as labelled data
and BP accuracy was close to 90%. Then, the size of labelled data was reduced to 15 and 7.
The following maps show the distribution of 15 and 7 labelled patterns in the data space.
2 4 6
MAP SlZE (X dimension)
Figure 5.4: SOM map for 15 IRIS labelled patterns
4 6
MAP SlZE (X dimension)
Figure 5.5: SOM map for 7 IRIS labelled patterns
Considering the fact that there is only a difference of 8 patterns between the two datasets,
BP had an accuracy of 85% for 15 patterns ((Figure 5.4) and 67% for 7 patterns ((Figure
5.5). After using the GCC system, BP accuracy increased to 92% (for 15 patterns) and
89% (for 7 patterns). These examples show how sensitive the selection procedure can
be when it comes to small-sized datasets. This sensitivity was ma@ed when ushg the
different variant s of the GCC procedure (Neigh bouring, Re-la belling, etc.) and considering
the number of classes, their distance relation and distribution. On the other hand, the
Mushroom dataset was large which made the selection procedure much easier. In this
dataset, 90 labelled patterns were selected from 5393 training data. The only consideration
was to make sure that the 90 patterns cover both classes.
Another objective of this experiment was to examine the GCC system performance
on different types of datasets. Results for each dataset (in the previous chapter) show
that the GCC system is capable of producing sufficient self-labelled data with reasonable
accuracy and, consequently, improve classification for all the datasets. However, the range
of improvement for each type varies. For example, the improvement in classification in the
Breast Cancer and Mvshroorn cases is larger than the E. coli or Déabetes case. In addition,
in the Breust Cancer and Mushroom cases, the size of the SOM map as well as the labelled
data can be reduced (as shown in Experiment 4).
As was previously mentioned, self-labelling procedure does not always result to correct
labelled data. Unlabelled patterns may be mis-classified due to overlaps in the SOM'S nodes
or small interclass distances in the datasets. The proportion of mis-classified data (obtained
by the GCC system) varies in each datasets based on their statistical characteristics. In
all the experiments, the GCC system performance was tested by training BP network on
the difFerent portions of self-labelled data. The E. coli has b e n chosen as an example
to investigate the effect of mis-classified data on classScation procedure. Instead of using
random selection for different portion of self-labelled data, a specific number of mis-classified
patterns had been selected with each portion. Figure 5.6 shows a sudden drop (fkom 70%
to 46%) in the BP network performance by addbg 25 number of mis-classifiecl data into the
second portion of self-labelled data. Please note that, in each step, the number of correct
self-labelled data was increasing (but it was not shown on the graph). By increasing the
size of correct self-labelled data, BP network performance improved (nom 46% to 69%) in
the last step.
90
Mis-dassif,ed data
Figure 5.6: Number of mis-classified data versus BP network's performance on the test data
5.2.1 Analysis of Experiment #4
This experiment was designed with the objective of resolving problems that were encoun-
tered in experiment one. In the real worId, problems related to the selection of labeiled
data can be costly since labelling is a costly procedure. It is important to select the most
effective unlabelled patterns and the smallest size to reduce the cost active learning. In
order to achieve this goal the SOM nodes have been used for the selection procedure. Thee
strategies were designed for the selection, each one with difFerent properties (results are
presented in the previous chapter).
The first strategy is the most powerful and effective one among three strategies and
provides a large number of labelled data. However, the resulting labelled set would be a
large one , infinite (Castelli and Cover, 1995), which would violate the objective of this
experiment- On the other hand, with a large labelled dataset, unlabelled data does not aid
in the reductio~ of classification error and the use of procedures such as GCC would be out
the of question. With the use of the second and the third strategies, selected labelled data
can be reduced to a desirable degree. The second strategy is very effective and comparable
with the first strategy. Because of existing randomess in the third strategy, it is not
as efficient as the other ones. However, the number of selected patterns can be exactly
specSed.
5.3 Analysis of Experiment #2
The objective of this experiment was to improve self-labelled data quality. The SOM net-
work was trained using training data (unlabelled and labelled data) instead of simply using
labelled data. Since unlabelled data provides idormation about the true distribution of
the data patterns (the IRIS data example in introduction section), the expectation was to
observe improvement in both seK-labelling and classification procedure for all the datasets-
Table 5.1 siimmarizes the results for the self-labelling procedure in the Alliance and Con-
servatiue methods. The first col-!imn for each method is the size of undefinecl patterns that
were clustered to undefined nodes. The second colwnn shows the number of mis-classified
patterns and the Iast one is the labelling accuracy. For al1 the datasets, labelling accu-
racy was increased with the Alliance method. In the E. coli, Diabetes and Head Diseuse
cases (Table 5.1), improvernents were produced by reducing mis-classsed data instead of
producing new self-labelled data. In the O ther dat asets, labelling accuracy was increased
due to improvement in quantity of self-labelled data (reducing the number of mis-classi6ed
data). This happens due to the information that the unlabelled data provides during the
procedure. (Please note that the results presented in Table 5.1 are based on Neighbouring
and Re-la belling techniques as well.)
Table 5 -1: Self-labelling accuracy in Alliance and Conservative met hods
Training Data Sets
IRIS
Despite the expectation, BP performance on training data produced by Alliance and
Conservative methods was very close to each other except for the Mvshroom dataset. h
the Mvshroon case, BP performance on test data increases from 93% in the Conservative
method to 98% in the Alliance method. By comparing the number of mis-classified patterns
in the Consemative and Alliance methods (241 to 31) and newly added self-labelled data
(23), 233 correct patterns were added as self-labelled data in the Alliance procedure. It
is important to note that 241 mis-classified patterns (Conservative procedure) could have
mis-lead the BP network. That amount of mis-classifieil data was later reduced to 31 in the
Alliance procedure. This example, once again, validates the assumption of finite labelleci
data and infinite unlabelled data and its effect- on the reduction of classification error. The
Mushroom set was divided into 90 labelled and 5303 unlabelled data. It is obvious that
5303 unlabelled patterns can provide sdc ien t information to improve labelhg accuracy.
In general, it can be said that the Alliance method improves the self-labelling procedure
and it should be used in the GCC procedure.
E. Coli Breast Cancer Mushroom Hart Disease Diabetes
5.4 Analysis of Experiment #3
In addition to the Alliance and Conservative methods, Neighbouring and Re-labelling tech-
niques were other variants of the GCC procedure that were used to increase the self-labelhg
ab%@ of the system. Empirical results for this experiment illustrated that the combina-
Conservative
18 16 183 138
1 181
Unde6ned 1 Mis-dassified
1 6 1 2
Alliance Labelling accuracy
87
Undefined
2 32 16 24 1 37 51
Mis-dassified 2
I
52 89 92 68 48
Labelling accuracy 91
1 1
18 22 60 97 96 75 57
6 160 118 162
5 31 19 30
tion of Alliance, Neighbouring and re-labelling techniques could improve self-labelled data
to sac ien t degree ( pervious chapter) . However, each technique's performance varies de-
pending on the statistical characteristics or the size of the datasets. The self-labelled data
resulting fkom the Neighbouring method is highly dependent on the intraclass and inter-
class distances. If the dataset interclass distances are not swûïciently large, the self-labelling
process wiU not be accurate and may result to a large number of mis-classified self-labelled
patterns. Mis-classifieci data may later mislead second-order-labelling and the classXcation
procedure (BP network). In Table 5.1, results in the mis-classXed columns demonstrate
that the datasets with a small interclass distances such as (E. wli, Diabetes and Heurt
Disecase) have larger numbers of mis-classiiïed data than other ones (considering the size
of dataset). By investigating the maps for each one of these datasets (4.1, 4.4 and 4.5),
it can be said that the role of the initial labelled data to control self-labellïng accuracy in
the Neighbouring technique is very cruciaL Large SOM maps can result in a lower number
of mis-classified data and a larger number of undefined data which may reduce the risk
of misleadhg the classification procedure. Despite the discussed problems, results for this
experiment show that the Neighbouring method is capable of improving self-labelled data
quality and quantity. It is important to note that all the selections (labellecl and unla-
belled data) during this experiment were randorn without using prior knowledge. This fact
conûrms the positive effkct of the Neighbouring method on the self-labelling procedure.
Re-labelling is a very usefd method, however, one might wonder how many times this
method should be applied during one procedure- At this time, it is not possible to provide
theoretical evidence on the number of Re-labelléng processes that should be taken for self-
labelling. Of course Re-la belling can be extended to arbitrary dept hs, although typically
one will reach a point of distinguishing rei;urn where no labels can be assigned to undefined
data (dabelled data that have not been assigned with labels).
The ideal situation for the Neighbouring and Re-labelling methods is a dataset with a
large interclass and small intraclass distance. In this situation, each round of the labelling
procedure will result in a large number of correct self-labellecl data. Then, in the Re-labelling
procedure, these data patterns and their neighbouring nodes will be used to obtain even
more correct self-labelled data. If this ideal condition for the class distances is not met,
then, the number of mis-classified patterns produced will be large. Analyzing the GCC
system performance step by step wiU resdt in constructing an error mode1 that can be used
as evidence to tenninate the labelling procedure. On of the important cc~sumptions in the
GCC procedure is that the probability of error in the original labelled data is zero. (As an
important side note, this assumption does not have to be correct). By proceeding trough
the Alliance and Consemative methods, some of the unlabelled patterns are mis-classifiecl
and are added to origind labelled data to be used again. Other than these two strategies,
the Neighhuring technique can result in mis-classilied patterns. Especially in dataets such
as E. coli, Diabetes, or Heart Disease where classes are scattered al1 over the data space
with no specific order and close to each other. Consequently, these newly added labelled
data (with lots of mis-classifiecl items) dong with the original labelled data will be used for
the Re-labellzirg procedure- Starting fiom the first stage (with the difference of having a
large amount of incorrect data patterns in the labelled data) can result in even more mis-
classifled data. Therefore, over application for the self-labehg procedure may jeopardize
the accuracy of the labelled data and has to be careiùlly considered. When the number of
undehed data patterns remains unchanged, the Re-la belling procedure has to be halted.
As an example, E. wii was selected to investigate the over-labelling problern. Two
dXerent labelled dataset were used for training and self-labelling process. One was the same
labelled data (32 patterns, 16 for each class (refer to previous chapter)) used in previous
experiments and the other one was 32 randomly selected samples, 8 from class 0157 and
24 from class non-0157. For each dataset, the Re-labelling procedure was applied 8 times-
Figure 5.7 demonstrates the labelling accuracy of these datasets where
Labelhg accuracy- (number of correct self-labelled data) / (number of self-labelled
data obtained).
For a balanced nuniber of classes, the labelhg accuracy remained mchanged (with
mînor changes), however, for an unbalancecl number of classes, the labelling accuracy de-
creased dras t idy- This happened because of the increase in the nuniber of mis-classified
data.
Figure 5.7: Test on E. do': relabelling accuracy versus nurnber of relabelling procas
75
70
65
bo 60
J
i Ell
5 50
45
40
5.5 Conclusion
- 1 I I 1 I 1 I r baianceci nurnber of dasses -
unbalanced nurnkr of dasses ------ -
68 68 67
-
- 56 ---_____-__ _ _ _ 56
---_ ----___-- _--- , _---.
-
- 44
L I 1 1 I I I
Ideally, the effort is to set specific conditions and to generalize system performance based
on those conditions. Generalization based on empirical results may not be applicable or
accepted without providing theoretical proof. However, it is difEcult to assess the quality
of the approach based on the theoretical arguments- This study does not concentrate on
theoretical aspect of the approach or experiments because of time Limitations. It is basically
an introduction to a new technique on the use of unlabelled data in classification. More
detailed study is needed to establish a theoretical bais for the current approach and its
performance.
O 1 2 3 4 5 6 7 8 9
depth re-labeiling pmcess
In this chapter, some of the questions related to the GCC system and its variations were
pointed out with an example. More detailed study still needs to be done. Major problems
during the experiments were the ones related to randomly the selected labelled datasets
(their distribution) and the size of the labelled data versus unlabelleci datasets. The good
news is that the majority of real-world problems deal with a large number of unlabelled
data. In addition, the number of classes, their distributions, and other prior knowledge
about the datasets are often available. With the use of Experiment 4 (to select labelled
data) and large number of unlabelled data the available, the GCC system seems to be a
promising approach.
Chapter 6
Conclusions
The weU known problem of insufEcient labelled data is still an open question in any field
of supervised learning where processes are faced with unsatisfactory performances and high
cost of labelling training patterns. Various attempts have been made to overcome this
problem. Much of the research, such as: CO-training by (Blum and Mitchell, 1998), or
combination of Expectation M-at ion wit li O ther classifiers or hierarchical shrinkage by
(McCallum and Nigam l999), focus in the use of udabelled data on improving classification
in the t ext problem domain. This t hesis has presented a simple, yet novel technique (Guelph
Cluster Class (GCC)) which shares some similarities with other approaches on the use of
unlabelled data. However, GCC is based on neural networks and has been tested on sets of
real-world pro blems other that text problem.
This work is concerned with a relevant neural classification problem (introduced in
Chapter 2) often encountered in practice where the amount of labelled data is ins-iicient for
training, validating and testing a neural network. The GCC algorithm has been described
dong with its implementation and practical applications. The novelty of this approach Iies
in applying a Self Organizéng Map for clustering and the use of the resulting clusters to
assign labels to unlabelled data (self-labelling process) . The ability of the SOM to provide
information about the structure of the labelled dataset was tested using the Alliance and
Consemative methods. The SOM has the tendency of clustering siniilar input vectors on the
same topological neighbourhood close to each other, This thesis has shown how to apply the
neighbouring nodes in the GCC system (Neighbouring approach) to produce self-labelleci
data in sufEcient quantities and with s a c i e n t accuracy. In addition, the GCC system used
the resulting self-labelled data to assign labels to unlabelled data (m-labelling method) and
obtain more labelled data. In the classification stage, self-labelled data was applied to
enhance the training of a BP network- Cornparison of the results was based on BP network
performance on test data using the original labelled data for training. To evaluate the
GCC system, the selected datasets (six rd-world benchmark datasets) were statistically
dxerent. Results of these experiments (1,2 and 3) showed that the GCC approach had
significant merit for integrating udabelled data into the domain of supervised learning.
In the last experiment, the clusters produced by the SOM were used to help select the
most effective samples of udabelled data which should be labelled by an expert. Since the
labelling process (of the original labelleci data) is a costly exercise, information from the
clustering stage codd direct attention to those unlabelled samples which would be useful
on the self-labelling process if their true labelling were known. In the last experiment,
original labelled data was reduced to a significantly smaller size than previous experiments,
although, the GCC performance remained the same for most of the datasets.
Similar to other approaches, there are some potential caveats into the use of the GCC
approach. Because of the random nature of the experiments and its sensitivïty to initial
data, it is hard to predict GCC performance. It is not exactly clear how to control the
number of mis-classified data and how it would effect the classification procedure. However,
the GCC system has shown to be a reliable technique to improve supervised learning.
It can work with a very small size of labeUed data (properly selected) and thus reduce
cost. The GCC is a flexible technique to be used on different datasets. It is an easy
technique for integrating unlabelled data, compared to other approaches in which some
essential assumptions have to be made which may not always be possible (e.g. (Blum and
Mitchell, 1998) and (McCallum and Nigam, 1999)).
6.1 Future Work
The studies described in this thesis layout the foundation for future researches. Further
more, they help open doors into new applicationst Yet, more detailed researches need to be
done in order to provide more insights into the 1:echnique.
Indeed, it is difficult to assess the quality of the introduced approach based on theoret-
i d arguments. However, some more theoretical background and more solid argumen-
tation would make the work stronger. This effort will be used in the selection of GCC
system features such as the SOM and labelling process (parameters and methods).
A more detailed investigation of statistical c.haracteristics of datasets would make the
seiection of GCC system parameters and methods fsuch as: Allzaace or Conservatiue,
Neighbouring, and re-labelling) much easier. It may answer some of the questions
(caused by the random nature of this study) and eliminate unpredictability in the
syst em's performance.
Usage of other clustering technique with similar algorithm can result in interesthg
outcornes. Other clustering technique may outperform the SOM self-labelling process
on some of the datasets.
Involving fuzzy logic theory in the labelling process may reduce the number of mis-
classified data. It is important to have control over mis-classified patterns, since they
rnay mis-lead the re-lcrbelling process and as a result the classification procedure.
Exploring other techniques and dgorithirw for labelled data selection (active leaming)
to achieve: i) the smallest number of labelled data (to reduce the cost), ii) the most
effective labelled data (to increase self-labelled data quantity and accuracy).
Investigating GCC system's performance on the datasets with more than three classes
that are nonlinearly separable.
Appendix A
Experiment al Resualt s
A.l Experiments 1, 2 & 3: Labelling Procedure
I I 1 Alliance -
Carisecvative -------
1 I r
lnltial stage Neighbouring Neighbounng+Re-lablling
Methods
Figure A.1: Labelling Accuracy, IRIS dataset
Alliance - Conse~ative --------
8 - - - - - - - - -
20 2 3 r
4
O Initial stage Neighbouring Neighbou ring+Re-labelling
Methods
Figure A.2: Labelling Ability, IRIS dataset
Figure A.3: L a b e h g Accuracy, E. coli dataset
90
85
80
bo $ 75 C E E a w 70 a
65
60
55
Initial stage
I T Alliance -
=--tive -------
-
-
-
-
-
- I I ,
Neighbouring Methods
Initial stage Neighbouring Neighbouring*Ra-labelling Methods
Figure A.4: Labelling Ability, E. coli dataset
Alliance - Conservative ------
I 1
Initial stage Neighbouring NdghbauringtRe-labelling
Methods
Figure A.5: Labelling Accuracy, Breast Cancer dataset
1 I I Alliance -
Consenrative -------
Initial stage Neighbourlng Methods
Figure A.6: Labelling Ability, Breast Cancer dataset
initial stage Nelghbouring
Methads
Figure A. 7: Labelling Accuracy, Mushroom da taset
w
Initial stage Nelghbouring
Methods
Figure A.8: Labelling Ability, Mushroom dataset
Figure A.9: Labelliog Accuracy, Dia be tes dataset
90
85
80
bo $ 75 E Ln E O
aa 70 CL
65
60
55
Alliance - 1
. v I L
Alliance - Conservative -------
-
-
-
-
- -
I 1
lnitlal stage
Initial stage Neighbouring NeighbouringtRe-labetling
Methods
Nelghbouring Methods
Figure A.10: Labelling Ability, Diabetes dataset
Figure A. 11: Labelling Accuracy, Heart Diseuse dataset
; 75 t
E m 70 a
65
60
55
I Alliance -
Conservative -------
- ---- -----___ ---------_______ 7 E 76 ----
74 -
-
-
I I
Methods
Initial Sage Nelghbauring ~eighbauring+Fie-labelllng
Methods
-
- Initial stage Nelghbourlng Neighbouring+Re-laberIllm
Figure A.12: Labelling Abiiity, Heart Disease dataset
A.2 Experiments 1, 2 & 3: Classification stage (Results on
test data)
95 1 I I 1 1 I
Allinace - Com-üve -------
90 -
85 - 7' a? 81 .-.--
80 - / :.- A --------- ---------
O 't __--- a
75 - .-
73 -
LabelIed+Setf-labeiled data
Figure A. 13: BP performance on the IRIS dataset
l I I 1 1
Consetvative - Alliance
74 73
Labelled+Self-labelled data
Figure A.14: BP performance on the E. colé dataset
LabelledtSeM-labelled data
Figure A.15: BP performance on the Breast Cancer dataset
110 I I I
Alliance - comervative
.- O
- 1000 M O 0 3000 4000 5000 6000
Labelled+Semabelled data
Figure A.16: BP performance on the Mushrwom dataset
60 50 100 150 200 250 300 350 400 450 500 550
tabel[ed+Self-labelled data
Figure A. 17: BP performance on the Dia betes dataset
1 r r r T T Alliance - Corservative -------
200 300 400
Labelled+SeH-labelleci data
Figure A.18: BP performance on the H a r t Disease dataset
A.3 Experiment 4: Classification stage (Results on test data)
30 40 50
Labelled+Self-labelled data
Figure A-19: BP performance on selected IRIS labeLIed data
I 1 I I I 1 Method 3 - Method 2 ------ Meth& 1 a.------
73
Labelledeelf-iabelled data
Figure A.20: BP performance on selected E. coli labelled data
lx) 1 1 1 I I I 1 I Method 3 - Method 2 --
110 Meth& 1 ------.-
50 100 150 200 250 300 350 400 450
Labelled+Self-labetled data
Figure A.21: BP performance on selected Bmast Cancer labelled data
O 1000 2000 30W 4000 5000 6000
Laballed+Self-lablled d a t a
Figure A.22: BP performance on selected Mushroom labelled data
, Method 3 '- Melhod 2 -- Melhod 1 .-.a-.--
40 1 I I I 1
100 200 300 400 500
Labe[led+Setf-labelled data
Figure A.23: BP performance on selected Dia betes labelled data
, I I I Method 3 - Method 2 --- M&,o,-J 1 .--.--.-
300 400 500
Labelled+Self-labelled data
Figure A.24: BP performance on selected Heart DzSease labelled data
Bibliography
Ba-, E-B-, Haussier, What Size Net Gives Valid Generalization?. Neurnl Computation,
1, 151-160, 1989.
Blum, A., and Mitcehll, T., Combining Labeled and Unlabeled Data with C d h h ï n g -
Proceedings of the 11th Annual Conference on Cornpututional Learning Theory, 1998.
Castelli, V., Cover, T.M., On the Exponential Value of Labeled Samples. Pattern recognition
Letters, 16, 105-111, 1995.
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S.: Guppy, K.,
Lee, S., and F'roelicher, V. , Internat ional application of a new probability algorithm for
the diagnosis of coronary artery disease. Arnerkm Journal of Cadiology, 64, 304-310,
1989.
Fisher, R., The use of multiple measurements in taxonomie problems. Annuls of Eugenics,
7, 2, 179-188, 1936.
Haykin, S., Neural Networks A Compreheusive Foundation. Prentice Hall, Upper Saddle
River, New Jersey, 1999.
Iba, W., Wogulis,J., and Langley,P., Tkading off Simplici ty and Coverage in Incremental
Concept Leanzing. In Proceedings of the 5th International Conference on Machine
Learning, 73-79, A m Arbor, Michigan: Morgan Kaufmaiui.
Joutsiniemi, S.-L., Kaski, S., and Lamon, T. A., SeE-organizing map in recognition of
topographic pa t te rn EEG spectra. IEEE Thnsactions on Biomedical Engineering,
42, 1062-1068, 1995.
Kaski, S. and Kohonen, T., Exploratory data analysis by the Self-organizing map: Struc-
tures of welfare and poverty in the world. Neural Networks in Financial Engineering.
Proceedings of the Third International Conference on Neural Networks , 498-507, 1996.
Kaski, S. and Lagus, K., Comparing self-organizing maps. Proceedings of In ternational
Conference on Artificial Neural Networks, 1112, 809-814, 1996.
Kohonen, T., Kaski, S., Lagus, K-, and Honkela, T., Very large twdevel SOM for the
browsing of newsgroups. Proceedings of International Conference on Artificial Neuml
Networks, 11 12, 269-274, 1996-
Kohonen, T., Self- Organizing Maps. S pringer, 1997.
Muller, K., Finke, M., Schulten, K., Murata, N., and Amari, S., A Numerical Study on
Learning Curves in S tochastic Multi-Layer Feed-Forward Networks. Neural Compu ta-
taon, 8, 1085-1106, 1996.
Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-trairiing.
Ninth International Conference on Infornation and Knowledge Management, 200.
Nigam, K., McCdum, A., Thrun, S., and Mitchell, T., Learning to Classifjr Text fiom
Labeled and Unlabeled Documents. Amerz'can Association for ArtZficéal Intelligence,
1998.
McCallum, A., and Nigam, K., Text Classification by Bootstrapping with Keywords, EM
and Shrinkage. ACL '99 Workshop for Unsupervised Leaming Zn Natuml Language
Processing, 1999.
Schuurmans, D., A New Metric-Based Approach to Mode1 Selection. Proceedings of the
Fourteenth National Confefience on Artificial Intelligence, Providence, RI. July 1997.
Shahshahani, B.M., Landgrebe, D. A., The Effect of Unlabeled Samples in Reducing the
Small Sample Size Problem and Mitigating the Hughes Phenornenon. IEEE f insac-
tions on Geoscience and Rernote Sensing, 32, 5, 1994-
Smith, J-W., Everhart, J-E-, Dickson, W.C., Khowler, W.C., and Johannes, R.S., Using
the ADAP learning a l g o r i t h to forecast the onset of diabetes mefitus. Proceedings
of the Symposium on Cornputer Applications and Medical Care, 261-265, 1988, IEEE
Computer Society Press.
S tacey, D.A., Preliminary Artificial Neural Network Analysis of E. wli Data. Unpublished
Report, Nov, 1998.
Virginia R. de Sa., Learning Classification with UnlabeIed Data. Advances in Neural
Information Processing Systems, 6, 112-119, 1994.
Zhang, J., Selecting typical instances in instance- based leaming. Proceedings of the Ninth
International Machine Learning Conference, 470-479, 1990, Aberdeen, Scotland: Mor-
gan Kaufinann.