Visual Description of Skin Lesions - School of Informatics

Visual Description of Skin Lesions

Matteo ZanottoT

HE

U N I V E RS

IT

Y

OF

ED I N B U

RG

H

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2010

Abstract

The work of this dissertation was aimed at getting a better understanding about the

way people evaluate visual similarity of skin lesions. Experiments testing the evalua-

tion performance achieved following the ABCD rule were run at first. Results showed

a substantial variability in the obtained evaluations which puts the usefulness of this

qualitative guideline under questioning. According to additional analysis, the use of

the ABCD rule in the development of automatic classifiers can be arguably discour-

aged. Experiments purely based on visual similarity, on the other hand, showed the

emergence of homogeneous visual classes of Basal Cell Carcinomas. These classes

delineate some visual criteria possibly followed by the observers during the assess-

ment. A system is developed to learn these criteria from the experimental data and

promising results are reported despite the limited availability of training and testing

data.

i

Acknowledgements

First of all I want to thank my supervisor, Dr. Lucia Ballerini, for her friendly help and

constant guidance. A warm thank you goes to Prof. Fisher for the numerous sugges-

tions he gave throughout the project. The long discussions and the precious comments

provided by Dr. Aldridge and Prof. Rees from the Department of Dermatology were

extremely helpful and I want to thank them for sharing their non-informatics point of

view on several topics.

Finally, I want to thank my family for their constant support, my friends back at home

who took part in rather disgusting surveys to grant me some data to work on, and those

here in Edinburgh with whom I spent long hours in the lab over the last 12 months.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Matteo Zanotto)

iii

Try Again. Fail Again. Fail Better.S. Beckett

iv

Contents

1 Introduction 11.1 Challenges of Skin Lesion Assessment . . . . . . . . . . . 21.2 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Novelty . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 62.1 Introduction to Skin Lesions . . . . . . . . . . . . . . . . 6

2.1.1 Types of Skin Lesions . . . . . . . . . . . . . . . 62.1.2 The ABCD Rule . . . . . . . . . . . . . . . . . . 8

2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 9

3 User Performance with ABCD rule 133.1 Experimental Set-up . . . . . . . . . . . . . . . . . . . . 143.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Impact of Visual Anchors . . . . . . . . . . . . . 183.2.2 Correlation of Different Properties . . . . . . . . . 24

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Visual Similarity of Skin Lesions 314.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 324.2 Definition of Visual Classes . . . . . . . . . . . . . . . . . 36

4.2.1 Multi-Dimensional Scaling . . . . . . . . . . . . . 374.2.2 Spectral Clustering . . . . . . . . . . . . . . . . . 42

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Automatic Classification System 525.1 Structure of the System . . . . . . . . . . . . . . . . . . . 52

5.1.1 Feature Extraction Stage . . . . . . . . . . . . . . 545.1.2 Classification Stage . . . . . . . . . . . . . . . . . 57

5.2 Experiments and Evaluation . . . . . . . . . . . . . . . . 585.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusions 62

Bibliography 64

v

List of Figures

2.1 Pictures of six classes of skin lesions. . . . . . . . . . . . 72.2 Pictures of three different Basal Cell Carcinomas. . . . . . 8

3.1 Web interface used for collecting data during the first ex-periment. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Diagram of main database tables for experiment 1. . . . . 163.3 Example of derived anchor points . . . . . . . . . . . . . 173.4 Impact of visual anchors on lesion D414b . . . . . . . . . 193.5 Matrix of scatter-plots showing the correlation patterns be-

tween the different evaluated properties. . . . . . . . . . . 263.6 Results of correlation analysis for each of the four clinical

classes of lesions. . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Basal Cell Carcinomas showing substantial differences inshape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Web interface developed to collect similarity assessmentdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Diagram of database tables for experiment 2. . . . . . . . 354.4 Results of Multi-Dimensional Scaling with a 1D output

layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Results of Multi-Dimensional Scaling with a 2D output

layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Hierarchical structure of the derived clusters. . . . . . . . 444.7 Results of Spectral Clustering on Sample 1 . . . . . . . . 464.8 Results of Spectral Clustering on Sample 2 . . . . . . . . 474.9 Results of Spectral Clustering on the Complete Dataset . . 484.10 Results of Spectral Clustering mapped on MDS plot . . . . 49

5.1 Features obtained through topographic Independent Com-ponent Analysis. . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Wrong assignments to the flat BCC class (upper row) andlow-confidence correct assignments to the non-flat BCCclass (lower row). . . . . . . . . . . . . . . . . . . . . . . 60

vi

List of Tables

3.1 Brown–Forsythe tests on changes in the variance after in-cluding visual anchors . . . . . . . . . . . . . . . . . . . 20

3.2 Statistics of the scores showing a statistically significantchange after the inclusion of the visual anchors . . . . . . 22

3.3 Mann–Whitney tests on changes of the average varianceafter including visual anchors . . . . . . . . . . . . . . . . 23

5.1 Results obtained on the image test set. True Class rep-resents the visual class the lesion belongs to, P(flat) andP(non-flat) the probabilities assigned to the two classes bythe Gaussian process . . . . . . . . . . . . . . . . . . . . 59

vii

Chapter 1

Introduction

The interest in the field of Computer-Aided Diagnosis (CAD) has been growing rapidly

over the last few years and CAD tools are expected to get higher importance and wider

application in the future. The reasons behind this quick increase in the interest shown

by the medical community is linked to the potentials automatic classification tech-

niques could have on the analysis of medical images. While object recognition tech-

nology is not yet mature to completely delegate the diagnostic process to a computer-

based system, in a few areas CAD tools are already developed enough to be of practical

use. In particular, they can have a very important role in the education of new clini-

cians, who can train having a “second opinion” provided by the system, or as a support

to non-specialised clinicians in their decision to direct patients to a specialist. One of

the disciplines in which CAD systems are increasingly popular is dermatology. There

are several reasons behind this, including the high increment of skin cancer cases re-

ported by several studies. The most interesting peculiarity of dermatology, though, is

that images of the skin can be acquired with standard equipment, such as digital cam-

eras, even without the presence of a doctor. While other kind of medical images, such

as MRI scans, can only be obtained within a hospital where specialised personnel is

available for diagnosis, skin images can be obtained easily by laypeople even without

supervision. This unique characteristic enables CAD systems to be used in the field of

dermatology even as self-screening tools, especially in those rural areas where access

to specialists is still not guaranteed. Moreover, they can be used by general practition-

ers, without the need for particular investments, to better decide whether to direct the

patient to a dermatologist or not.

Based on these observations, several papers have been recently published presenting

automatic systems for skin lesion classification. Most of the work done up to now

1

Chapter 1. Introduction 2

(see section 2.2 for a review or the relevant literature) concentrated on the detection

of melanoma which is the most dangerous but even the most rare lesion. Conversely,

very few attention has been paid to other kinds of skin lesions which, despite gener-

ally being non-life-threatening, still require treatment before they cause complications.

While this specialisation on melanoma is generally overlooked, the ability to classify

all the potentially dangerous types of lesion is paramount to obtain systems which can

have a real impact on healthcare.

In order to bridge this gap, a lot of research effort has been dedicated in this University

to the classification of non-melanoma skin lesions and this dissertation is meant to give

a contribution, exploring some new research directions.

1.1 Challenges of Skin Lesion Assessment

Even though skin lesion classification might seem very similar to any other object

recognition task at a first analysis, sharp differences emerge when closer attention is

paid to its specific characteristics. At the highest level, these differences can be divided

in two major classes. The first class regards specificities which make the assessment

difficult for humans, while the second class focuses on those aspects which challenge

automatic classifiers. Obviously the boundary between the two classes is not always

well defined, but this subdivision provides a good general approximation.

The main problems faced by humans as they assess skin lesions are due to the guide-

lines currently in use in the dermatological community. Regardless being addressed

to clinicians or laypeople, these guidelines generally rely upon descriptions based on

concepts which are assumed to be universally valid as a sort of implicit standard. Some

examples would be concepts like light vs. dark, regular vs. irregular, symmetric vs.

asymmetric. Even though their general meaning is clear to anyone, evaluating the de-

gree of one of them, say asymmetry, can give rather subjective results.

A second major issue for people lays in the fact that, because of the structure of the

visual system, our perception is context dependent. The same shade of grey, for ex-

ample, is perceived lighter or darker according to the colour pattern of its surrounding

area. The same phenomenon happens for skin lesions, and has a lot of implications on

the assessment phase.

Even limiting the attention to these two aspects without getting into more subtle details,

we observe that evaluation can depend not only on the subjectivity of each individual,

but even on the context which, in the case of skin lesions, consists in the aspect of the


surrounding normal skin. This situation can be seen as a problem in finding a standard

representation of specific concepts, a description which regardless any subjectivity and

variation of context still uniquely identifies the relevant details.

While considering automatic classification, on the other hand, the opposite problem

is faced as it is still difficult to equip computer systems with enough abstraction and

generalisation capabilities to avoid pitfalls. Coupling this with the extreme variability

in colour and texture that normal skin can exhibit, it is not surprising that automatic

systems often provide a segmentation of the image in region of interest (lesion) and

background (surrounding skin) which is far from being ideal.

Another limitation of computer systems is the difficulty in dealing with high-level fea-

tures which are easily managed by humans. The presence of a blood vessels on a patch

of skin, for example, can be effortlessly recognised by an observer, but due to the vari-

ability in its configuration and the presence of “clutter” (e.g. visible skin pattern or

hair) it is not always easily detected by computer vision systems.

Both of these classes of difficulties must be considered when designing an automatic

system for skin lesion classification and each of them had been investigated in the work

of this dissertation.

1.2 Aims

The two main hypotheses underlying the project are that qualitative guidelines like the

ABCD rule (see section 2.1.2 for details) do not reflect properly the knowledge used

by dermatologists in the diagnostic process and that a new approach to classification

can be derived from results obtained in experiments based on visual similarity.

Empirical observations show that people with no medical training and no specific der-

matological knowledge are capable of grouping images of skin lesions in coherent

classes and subclasses. This evidence suggests that some intrinsic visual characteris-

tics can act as guidance in the classification task. If such features exist, image analysis

and machine learning techniques can be used in an attempt to extract them from the

visual classes created by humans, and ultimately to develop an automatic classification

system relying upon them.

As mentioned before, implementing a system to automatically classify skin lesions is

not trivial and many specificities regarding their appearance need to be taken into con-

sideration in order to succeed. The aim of this thesis is investigating more accurately

than it has been done before about the two classes of challenges introduced in the previ-


ous section. In particular, the first part of the dissertation will concentrate on providing

a better understanding about how well people can follow qualitative guidelines like

the ABCD rule. This will be done analysing the data of an experiment simulating the

self-screening procedure people are encouraged to perform on a regular basis to detect

melanomas in early stages. Shedding light into this is expected to give a better under-

standing on whether automatic classifiers should be based on such guidelines or not.

The second part, on the other hand, will focus more on how to design systems capable

of dealing effectively with the second class of challenges, those linked to the limited

abstraction capability of automatic classifiers. To do so, an experiment has been set up

to obtain groups of skin-lesion images judged by humans to be visually similar. The

resulting data have then be used in an attempt to develop an automatic system capable

of replicating these visual classes. The aim of this second stage was testing new design

strategies which could be effective in order to overcome the limitations often shown

by classifiers proposed for application in dermatology.

1.3 Novelty

Several aspects of this dissertation are dissimilar, a least to some degree, to the work

which has been previously proposed in the field. Performance of people’s ability to

assess skin lesions using the ABCD rule has been evaluated before, but in a different

way and never with the perspective of testing whether this rule could prove useful in

the development of automated classifiers. Secondly, to the knowledge of the author,

the whole process used to design the proposed system starting from humans’ percep-

tion and trying to replicate their ability to evaluate visual similarity has never been

used before. Finally no work regarding visually recognising sub-classes of Basal Cell

Carcinoma has been previously proposed.

1.4 Overview

The dissertation is organised as follows. Chapter 2 gives an introduction to the der-

matological concepts needed to understand the following chapters. Additionally, it

provides a review of the literature related to the classification systems specifically de-

signed to work on skin-lesion images. Chapter 3 presents the outcomes of the exper-

iment performed to evaluate people’s ability to use the ABCD rule. The part more

closely related to the development of the automatic classifier starts with Chapter 4,


presenting both the experiment used to gather data regarding visual similarity and the

obtained visual classes, and continues with Chapter 5 where a detailed description of

the developed system is given along with a discussion regarding its performance. Fi-

nally Chapter 6 draws conclusions and proposes future directions of research.

Chapter 2

Background

In this chapter an overview of the different topics needed to understand the rest of the

dissertation will be provided. In particular, a quick introduction to skin lesions is given

in section 2.1.1, while the ABCD rule for melanoma detection is presented in section

2.1.2. A detailed coverage of the topic is beyond the scope of this work and only the

facts directly useful to understand the framework in which the research was performed

will be reported. A literature review covering the work already done in the field of

automatic classification of skin lesions is presented in section 2.2.

2.1 Introduction to Skin Lesions

This section summarises briefly the dermatological concepts which will be extensively

used in the following chapters. It includes an overview of the different types of skin

lesions and an introduction to the ABCD rule used for melanoma screening.

2.1.1 Types of Skin Lesions

The term skin lesion is fairly general and is used to refer to a variety of phenomena.

Roughly speaking, a skin lesion is any kind of skin patch which presents different

characteristics when compared to its surrounding area. Examples of some types of

skin lesions can be found in Figure 2.1 showing six classes of major interest:

• Seborrhoeic Keratosis (SK)

• Melanocytic Nevus

• Actinic Keratosis (AK)

• Basal Cell Carcinoma (BCC)

6

Chapter 2. Background 7

(a) Seborrhoeic Keratosis (b) Melanocytic Nevus (c) Actinic Keratosis

(d) Basal Cell Carcinoma (e) Squamous Cell Carci-

noma

(f) Melanoma

Figure 2.1: Pictures of six classes of skin lesions.

• Squamous Cell Carcinoma (SCC)

• Melanoma

SKs and Melanocytic Nevi are benign forms of skin lesions, AKs are considered a

pre-malignant condition, while BCCs, SCCs and Melanomas are malignant forms of

skin lesions. Among the last three, Melanoma is the most dangerous causing the ma-

jority of skin-disease related deaths despite being one of the less common cutaneous

cancers. BCCs and SCCs are less dangerous than Melanoma, but are still considered

malignant lesions. They rarely metastasise, especially BCCs, but they both need treat-

ment because of their tendency to expand to nearby tissues. Despite growing slower

than SCCs, BCCs are highly destructive and, if not treated while in their early stages,

can cause significant damages possibly extending beyond the skin of the patient.

An important peculiarity of some of these classes is the extreme variability in appear-

ance they can present. As an example, Figure 2.2 shows different images of Basal Cell

Carcinomas.

As can be seen in the pictures, see Figure 2.1 and Figure 2.2, skin lesions present some

peculiar characteristics. At a first analysis it appears clear that the difference between

the types cannot simply be described by a single attribute. In other words it is not


(a) (b) (c)

Figure 2.2: Pictures of three different Basal Cell Carcinomas.

possible to visually discriminate between classes purely on the basis of one property

such as colour or shape. This is due to several reasons including the aforementioned

variability shown within each class, to the specificity of each individual’s skin and

ultimately to the difficulty faced when trying to define the border between the lesion

and the surrounding normal skin, especially when dealing with non-pigmented (such

as most of the BCCs) or pale skin lesions.

This lack of a single distinctive marker led, over the years, to the formulation of several

multi-criteria guidelines intended to be used both by clinicians during the diagnostic

process and by laypeople for self-screening. Two important examples are the ABCD

rule [23] and the 7-point check-list [7]. Only the ABCD rule will be briefly presented

as it is relevant for this work, the interested reader is pointed to the cited paper for

presentation of the 7-point check-list.

2.1.2 The ABCD Rule

The ABCD rule was proposed in 1985 by Friedman et al. [23] as a guideline both for

clinicians and laypeople to visually recognise potential melanomas in the early stages

of development. Specifically, the rule suggests to evaluate 4 properties of the lesion to

verify whether it is potentially dangerous:

• Asymmetry as melanomas tend to be asymmetric both in shape and in terms of

colour distribution

• Border irregularity as melanomas have less defined and more jagged borders

than benign lesions

• Colour variegation as melanomas tend to have a non-uniform colour distribution

• Diameter as melanomas tend to be wider than 6 mm


Major stress is put on the fact that the sooner melanomas are identified, the higher

the probability of effective surgical removal, which translates in a very high survival

rate. For this reasons, people are encouraged to actively examine their skin following

the ABCD rule in search for suspicious signs which might suggest the development of

melanoma.

Over the years studies have been conducted on the effectiveness of the ABCD rule,

such as Brandstrom et al. [13], Gunasti et al. [24], Meyer et al. [37], Reetz Muller

et al. [44]. While some papers [13, 44] claim that the use of the ABCD rule had a

positive impact on the answers given by the participants, others [24, 37] point out a

substantial variability in the way different people assess some of the criteria. Even the

results obtained by Laskaris in his Master’s thesis [27] support the claim suggesting

that people evaluate the same skin lesions in different ways.

Given the importance of a correct evaluation of the four key properties in order to

obtain successful results with the ABCD rule, the findings presented in [24, 27, 37]

highlight the necessity for a more extensive and closer study of people’s assessment

performance. An attempt to gain a better understanding of such an evaluation variabil-

ity was part in this dissertation and will be presented in chapter 3.

2.2 Literature Review

Over the past ten years many research papers applying machine learning and com-

puter vision techniques to dermatology have been proposed. This tendency reflects the

growing importance of Computer-Aided Diagnosis especially in those disciplines, like

dermatology, where images of the patients can be easily obtained, often with readily

available equipment.

From the methodological point of view a distinction must be done on the type of im-

ages that are used in the proposed systems. Two main categories can be found: papers

working on normal camera photographs and papers using dermatoscopic images. Der-

matoscopic images are obtained through a dermatoscope which consists in a magnifi-

cation device (typically 10x) equipped with a source of light and engineered to avoid

skin specularities generally through the use of a polariser. Despite the difference in

the acquisition phase, which introduces a difference in the level of available detail, the

image-analysis algorithms applied in the two cases are equivalent and hence the tech-

niques will be presented together without any particular distinction.

Most of the techniques proposed in literature focus their attention on the recognition


of melanoma (see [33, 42] for reviews), while only a few attempts to classify other

kinds of lesion have been made. Due to the considerable difference between the work

done in this dissertation and in previous research, it will be impossible to evaluate the

obtained results through comparison with benchmarks. For this reason, this literature

review is mainly aimed at giving a flavour for which techniques have been previously

used in the field, while it is not meant to report data about their performance. Another

reason for avoiding focusing on the claimed performance is that, given the substantial

lack of standardised image-sets (like Caltech-101 [5] or similar for object recognition

applications), each technique has been tested on a different image collection, making

any fair comparison impossible.

The rest of the section will be dedicated to outlining some of the most relevant papers

found in research related to computer imaging applied to skin-lesion assessment.

It is interesting to note how most of the proposed papers base their classification sys-

tems on features directly derived from the ABCD rule either in its form for normal

[23] or for dermatoscopic images [40]. While on one hand this is supposed to be a

convenient way to incorporate knowledge into the system, on the other hand it con-

strains the classification to be performed in a domain where even humans tend to have

evaluation difficulties (see chapter 3). Moreover, as LeCun pointed out in one of his

talks at UCLA during the 2005 IPAM Graduate Summer School [1], fields like natural

language processing have benefited from a learning phase not conditioned by human-

crafted added knowledge, suggesting that the same approach might prove beneficial

even in computer vision. Despite this observation, papers directly related to the ABCD

rule are still the vast majority and will be presented first.

Messadi et al. [36] derived 6 features evaluating the ABCD criteria and used them to

perform classification with an artificial neural network. In the proposed paper the seg-

mentation of the lesion is automatic and edge detection is performed on the basis of the

projection of the image in the space spanned by its first principal component obtained

through Principal Component Analysis (PCA). She et al. [50] combined together 6

features derived from the ABCD rule and 2 describing the texture of the skin pattern

obtaining a 8-dimensional vector. This descriptor is then projected in a 2-dimensional,

space through PCA reducing the dimensionality of the problem. Classification is per-

formed with a linear classifier in the obtained 2D space. The authors claim success-

ful results, but no record is given about how they addressed the risk of generating

non-linearly separable classes when operating dimensionality reduction through PCA.

Celebi et al. [17] proposed an approach where multiple features linked to the ABCD


rule are computed along with some additional texture descriptors based on the grey

level co-occurrence matrix (GLCM) [20]. All the features are pooled together and an

automatic selection of the most relevant ones is operated following several different

filtering techniques. The set of the selected features is then used to perform classifica-

tion with a Support Vector Machine (SVM). Remaining in the domain of the ABCD

rule, systems operating classification on the basis of a subset of the 4 original lesion

properties can be found. Clawson et al. [21] proposed a recognition system based only

on border irregularity as described by a harmonic wavelet transform. Seven descriptors

are obtained at different resolution levels in order to capture the various aspects of bor-

der irregularity. These features are then used both in a system for modelling experts’

perception of irregularity (using irregularity evaluations provided by dermatologists)

and for melanoma/benign lesion classification through different approaches such as

Boosting and Artificial Neural Networks. Again focusing on a subset of the ABCD

properties, more methodological papers have been presented dealing simply with find-

ing good descriptive features without any specific implementation of the classification

stage. As an example, Li et al. [31] measured asymmetry (only for shape, colour dis-

tribution was not considered) and border irregularity with a multi-scaled local fractal

algorithm, claiming an advantage over benchmark features in terms of discrimination

power. More generally Celebi et al. [16] reviewed a variety of approaches for auto-

matic border detection, a fundamental step for all the previously mentioned techniques

which heavily rely upon a good segmentation of the lesion in order to compute the

descriptors which are then used for classification. Border detection, in fact, is the fun-

damental step underlying the evaluation of any measurement used to assess asymmetry

and border irregularity, and plays an important role when determining where the de-

gree of colour uniformity must be estimated.

Reflecting the attention given in recent years to visual patterns for skin lesion classifi-

cation (e.g. [7]), automatic systems based on pattern analysis techniques have emerged

as a useful alternative to avoid relying upon the ABCD rule. As an example, Serrano

and Acha [48] proposed a method of classification based on a formulation of Markov

Random Fields extended to model the interdependence of the different colour planes.

Another example is the system presented by Tabatabaie et al. [52] using Independent

Component Analysis (ICA) and colour features for melanoma detection through the

use of a Support Vector Machine. While the use of ICA is an interesting peculiarity of

this paper, a few implementation choices present some critical aspects. The first is that

ICA is performed only on the region of the lesion, requiring a segmentation algorithm


capable of separating it from the normal skin. This process, even when performed cor-

rectly, might result in the loss of some important characteristics of the lesions if they are

located along the border. Secondly, two sets of filters are learnt and used as coding dic-

tionaries to decide whether a test image is better represented by the filter bank learnt on

melanoma images or, conversely, by the one derived by the benign lesions. While this

strategy is probably easier to deal with in the classification phase, this training method

might result in 2 sets of very similar filters with just a few of them having very high

discriminative power. Considering the computational cost of performing ICA, more-

over, the derivation of a single set of filters from a mixed collection of images might be

more appropriate. A work by Mendoza et al. [35] presents a series of scale-invariant

pattern descriptors which alleviate the problems generally caused by differences in the

size of the lesions. The 24 features consist of measurements performed on the different

regions of a black-and-white mask obtained from the original image in a way that pre-

serves the important characteristics of the lesion pattern. Classification is performed in

the feature space following the nearest-neighbour methodology. Finally, Capdehourat

et al. [15] used Adaptive Boosting applied to decision trees to classify melanocytic

skin lesions as benign or melanoma. Lesions are preprocessed to remove hairs, auto-

matically segmented and subdivided in three different regions (interior, outer border

and inner border) where properties are evaluated. A total of 57 features focusing on

colour and texture are extracted and used for classification. The authors point out that

texture analysis is done using Gabor filters without investigating the presence of par-

ticular geometric structures relevant to the 7-point check-list [7] due to the inherent

difficulties presented by the task. This limitation, found in many other systems, puts

serious constraints on how close the machine performance can get to that of humans,

especially in cases where a correct classification can only be achieved evaluating com-

plex properties of the lesion.

All the classifiers presented up to now are aimed at detecting melanoma and little work

has been produced on other kinds of skin lesions. One example is the system pro-

posed by Chaudhry et al. [18] which classifies BCCs and SCCs using features based

on wavelets.

In order to expand this area, a considerable amount of research has been done at the

University of Edinburgh (e.g. McDonagh et al. [34], Ballerini et al. [8]) to produce

better classifiers for non-melanoma skin lesion.

Chapter 3

User Performance with ABCD rule

The ABCD rule proposed by Friedman et al. [23] and supported by the American

Academy of Dermatology relies strongly on the assumption that people can effectively

describe what they see in terms that, albeit qualitative, show consistency across differ-

ent observers. This is true for all the qualitative rules currently in use in dermatology

either to support the diagnostic process or as self screening guidelines.

During experiments performed last year and reported in Laskaris’s MSc Thesis [27],

evidence emerged suggesting a substantial variability in the assessment different peo-

ple give when evaluating characteristics of the same skin-lesion image. Specifically, it

was observed that when asked to assess five properties of skin-lesion images (namely

colour, colour uniformity, asymmetry, border regularity and roughness of texture) peo-

ple gave very different evaluations for the same picture.

The first task of this dissertation regarded investigating more rigorously the consis-

tency of the qualitative judgement people provide when presented with a skin-lesion

image, in order to understand whether the variability observed was caused by the spe-

cific experimental set-up or rather by a real difference in the way each person interprets

the concepts given by the guidelines.

As in last year’s experiment people were asked to rank the qualities moving a slider

on a scale having only linguistic expressions as references (e.g. light/medium/dark for

colour), separating the effect of a subjective interpretation of the extrema and that of

the intrinsic variability in the assessment was impossible. In order to isolate the two,

a new experiment has been performed, modifying the interface with the addition of

visual anchor points to the linguistic references. A new image-set has been provided

for the experiment by the Department of Dermatology of the University of Edinburgh.

While nearly half of the images were the same of those used in Laskaris’s research

13

Chapter 3. User Performance with ABCD rule 14

[27], new ones have been introduced in order to get a more balanced representation

of the main diagnostic classes. Thanks to the better coverage of the types of lesions,

additional studies on the distribution of the answers could be performed.

As three of the tested characteristics (asymmetry, border regularity and colour unifor-

mity) are the first three elements of the ABCD rule introduced in section 2.1.2, this

experiment constitutes an empirical study on how such a guideline can prove useful in

the self-screening for melanomas.

In this chapter the interface used to collect the data and the experimental set-up will be

presented at first, then the obtained results will be discussed.

3.1 Experimental Set-up

In order to guarantee comparability with the answers previously obtained [27], the web

interface used for collecting data has been kept substantially unchanged from the one

used last year. The page (see Figure 3.1) is structured to present 45 skin-lesion images

to the user in a randomly selected order, requiring the assessment of the 5 previously

mentioned properties: colour, colour uniformity, asymmetry, border regularity and

roughness of texture. The evaluation is provided through the use of a set of analogue

sliders (one for each property) which can be moved left-to-right producing an asso-

ciated score in the continuous 0–10 range. The random ordering was introduced to

minimise the evaluation bias known to be an issue in perceptual-based experiments on

sequences of samples [12]. The bias is mainly due to the fact that people tend to adjust

their evaluations on the basis of what they have previously seen, often considering,

in the assessment, the evaluation given to previous samples. The effect is greater in

cases where people are presented with subjects they are not familiar with, as the aid

provided by prior knowledge is limited. While the randomisation cannot eliminate the

bias for each single observer, the effect on the final dataset, if any, should be substan-

tially smoothed out as each user is presented with a different sequence of images. The

presence of the visual anchors should also contribute to a reduction of this bias as the

user has static references to compare the images against.

Technically, the web interface consists in a set of php pages and JavaScript scripts

which record the answers of the user and store them on a PostgreSQL database. Minor

changes were necessary on the web pages, while the tables of the PostgreSQL database

have been recreated by reverse engineering starting from the php code since the origi-

nal structure was lost during last year’s server clean-up. A diagram of the main tables


Figure 3.1: Web interface used for collecting data during the first experiment.


Figure 3.2: Diagram of main database tables for experiment 1.

is shown in Figure 3.2. Although a better structure could have been obtained limiting

the redundancy of the stored information, tables had not been modified to allow an eas-

ier comparison of the results with those obtained last year. Considering the relatively

small amount of data stored in the database, the advantage of an optimised structure of

the tables would have anyway been very limited.

As previously mentioned, the interface (see Figure 3.1) differs from the one used last

year only because of the introduction of the visual references which can be seen at

each end of the sliders and in the middle. The design of the visual anchor points was

quite important for guaranteeing accurate experiments and, before being included in

the web-interface, they have been validated by the Department of Dermatology of the

University of Edinburgh.

The first design choice was that of using cartoon-like graphics, instead of real lesion

images, in order to help the user focus on the properties under evaluation one at a time.

The main risk of using real images would have been that of having the user evaluation

affected by properties not under scrutiny but suggesting high similarity between the

sample and one visual anchor. As reported in many studies of similarity perception

such as [38, 46], colour is often one of the most influential properties when evaluating

the likeness of different images. If real images were used as anchor points, people

might have been misled to move one slider, say the one for asymmetry, towards one of

the references only for a resemblance in colour between the image under assessment

and the visual anchor. The stylised grey-scale endpoint images are less prone to this

undesirable effect and careful attention has been paid to select images carrying as few

information as possible about the properties not directly linked to the afferent slider.

The only exceptions are the anchors for texture, where patches of real images needed to

be used as creating artificial samples satisfactorily representative of real lesions would

have been impossible.

Whenever possible the visual anchors were obtained through graphical elaboration of


Figure 3.3: Example of original images (left) and obtained anchors (right) for asymmetry

(top) and border irregularity (bottom). Original images obtained from [4].

the examples provided by the American Academy of Dermatology on their web-page

illustrating the ABCD rule [4]. This was the case for all the references given for colour

uniformity and for the upper extrema for both asymmetry and border regularity. All

the other visual references have been obtained from real images of the DERMOFIT

database [2] after discussion with the team of dermatologists. Some examples show-

ing the original images and the obtained anchors are reported in Figure 3.3.

While volunteers were actively recruited in order to have enough answers to make a

comparison with data collected last year [27], the web-survey will be available online

even after the end of this project to allow occasional contributors to provide their an-

swers. The hope is that of obtaining a substantially wide dataset going far beyond the

42 answers currently stored, which could result in a deeper understanding of the way

laypeople assess skin-lesion images with qualitative guidelines like the ABCD rule.

The survey has been proposed to three different categories of people, ranked on their

level of education in skin-lesion assessment. The wider class consisted in people with

no medical training and the experiment could be considered as a simulation of a self-

screening procedure conducted on a variety of skin lesions. The results from this group

were expected to shed more light on how precisely people evaluate the key properties

on which self-screening guidelines, in particular the ABCD rule, rely upon. The sec-

ond group was that of dermatologists. Given the substantial prior knowledge derma-

tologists have on skin lesions, this was considered a control group to verify whether

variability in the evaluation is mainly due to lack of prior knowledge of laypeople or

rather to differences in personal perception and assessment. Finally a group of people

with some dermatology-related knowledge (medical doctors with different specialisa-

tions, nurses, medicine/nursing students, etc.) was included as an intermediate level

between the two extremes. At the time of writing, answers have been obtained from

33 laypeople, 4 dermatologists and 5 non-dermatologist medically-trained people.


3.2 Results

The data analysis phase is divided in two stages each focusing on some specific aspects.

In the first stage the effect of the inclusion of visual anchors is tested comparing the

results of the experiment with those obtained using only the linguistic references. The

second phase, on the other hand, is related to the distribution of the collected data, with

particular attention given to the correlation observed between the answers obtained for

the different properties.

In order to correctly understand the numerical values presented in the following anal-

ysis it is useful to remember that the endpoints for each property are

colour 0 = light 10 = dark

colour uniformity 0 = uniform 10 = not uniform

asymmetry 0 = symmetric 10 = asymmetric

border regularity 0 = regular 10 = irregular

roughness of texture 0 = smooth 10 = rough

3.2.1 Impact of Visual Anchors

The first and probably most important part of the data analysis was dedicated to eval-

uate how the inclusion of visual anchors affected the answers given by the volunteers

who took part in the experiment. The reason behind this experiment was, in fact, un-

derstanding whether the extremely high variance in the scoring reported by Laskaris

[27] was due to the lack of standardisation of the extrema of the scoring scale or rather

to a more intrinsic variability in the answers linked to the subjectivity of the evaluation

process.

Only the 20 images shown both in this experiment and in previous ones were consid-

ered in this part. For each of them, statistical testing has been performed to verify if

the inclusion of the visual references resulted in a statistically significant change in

the variance of the answers. It is important to underline that only the variance of the

measurements is comparable between the two sets of data, while any observed change

in mean does not carry any useful information. This is due to the fact that, regardless

the careful selection procedure, any choice of visual anchors is somehow arbitrary, es-

pecially for the central reference. It is reasonable to assume, hence, that a different

set of visual references would result in a shift in mean, while the spread around this

mean should remain substantially stable. A notable exception are those skin-lesion


−−−−−−

−−−−

−

−−−

−−−−

−

−

−−

−−

−−−−−−

−

−

−

Answers for Lesion D414b

scor

e

−

−

−

−−−

−

−−−

−

−

−

−

−

−

−−

−−

−

−

−

−

−

−

−

−−

−

−

−−−−

−

−

−

−−

−

−−

−

−

−

−

−

−

−−−

−

−

−−

−

−

−

−

−

−

−

−

−−

−

−

−

−−−

−−

−

−

−−

−

−

−

−

−−

−

−

−

−−

−

−−

−

−

−

−

−−

−

−−

−−

−−−

−

−

−

−

−

−

−

−

−

−

−−

−

−−

−

−−−

−

−−−−−−

+

++

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

−−−−−

−−−

−

−

−

−−

−−

−−

−

−−

−−−

−

−−−−

−

−

− −

−−

−−−−

−

−

−−−

−

−

−

−−−

−

−

−−−

−

−

−−

−−−

−

−−

−

−

−

−

−−

−

−

−

−

−

−

−

−−

−

−

−

−

−

−

−

−

−−

−−−−

−−−−−

−−

−

−

−

−

−

−

−−

−

−

−

−

−

−

−

−−−−

−

−−−−

−−

−

−−

−−−

−

−−−−−

−

−

−

−−

−

−

−−

−−

−−−

−−−●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

01

23

45

67

89

colou

r

col. u

nifor

mity

asym

met

ry

bord

er re

g.

roug

hnes

s

without visual anchorswith visual anchors

Figure 3.4: Impact of visual anchors on answers for lesion D414b. Marks encoding level

of skin-related knowledge: ’-’ laypeople, ’+’ medically-trained people, ’◦’ dermatologists

images whose average scoring for some of the properties is very near to the extrema

of the scale. Since the score scale is bounded to the 0–10 range, in fact, the closer the

score gets to the maximum (or minimum), the lower the variance tends to become as

an effect of the upper (or lower) bound.

In the first place each of the 20 images was considered separately and the significance

of the observed change in variance was tested. Given the specific characteristics of the

scoring system (e.g. the limited 0–10 scoring range) and of the collected answers (such

as small dimension of the samples, heavy-tails and skewness of their distribution) the

data could not be considered to be distributed according to a Gaussian. The Bartlett’s

test [9] was hence inappropriate for the analysis due to its assumption of Normality

of the data. As an alternative, the Brown–Forsythe [14] test was used. The Brown–

Forsythe test is a variation of the Levene’s [30] test in which the median is used in

place of the mean of the sample. This difference makes the test more robust in cases

where the data under analysis show a highly skewed distribution. Given the aforemen-

tioned bounded score scale, skewed distributions are often observed and hence this test


Significant Changes Reduction IncreaseColour - - -

Colour Uniformity 5/20 3/20 2/20

Asymmetry 2/20 2/20 -

Border Regularity 10/20 10/20 -

Texture Roughness 6/20 2/20 4/20

Table 3.1: Summary of the results of the Brown–Forsythe tests (at 95%) on changes in

the variance after including the visual anchor points. Each column reports the number

of significant changes for each feature over the 20 test images.

is more appropriate. Specifically, the computed test statistic is

(N− k)(k−1)

∑ki=1 ni(zi·− z··)2

∑ki=1 ∑

Nij=1(zi j− zi·)2

∼ Fk−1,N−k

where

zi j = |yi j− yi·|

z·· =1N

k

∑i=1

ni

∑j=1

zi j

zi· =1ni

ni

∑j=1

zi j

and N is the total number of samples, k is the number of groups, ni is the number of

samples in group i, yi j is the value of the j-th sample of the i-th group (in our case

i = {1,2} representing the answers before and after the introduction of the visual an-

chors), yi· is the median of the i-th group, z·· is the mean of all zi j, zi· is the mean of zi j

for elements of group i and Fk−1,N−k represents the F distribution with k−1 and N−k

degrees of freedom.

The tests have been run at a 95% confidence level and the results are presented in

table 3.1, where the number of the significant changes in variance is reported along

with their direction (increase/reduction). While it is clear that the introduction of the

visual anchors had no effect whatsoever on the variance of the answers obtained for

the colour of the lesion, the other results cannot be interpreted without further analysis.

The reason behind this necessity lies in the fact that since the gathered scores belong to

the 0–10 real interval, they should be modelled as censored distributions, with censor-

ing taking place both on the lower and on the upper side. What happens in practice is


that as the mean approaches one of the extreme values, let us consider the lower bound

0 as an example, the data will progressively show less variability since no value lower

than 0 is allowed. This shrinkage of the distribution is actually artificially induced by

the bounded scale and for this reason all the variances obtained for values near the

extremes are to be considered unreliable. While the statistical tests used cannot cope

effectively with it, this situation should not be overlooked as an observed reduction in

the variance might actually be the effect of a shift in the mean of the distribution to the

region of one of the extreme values. As it turns out this is often the case. Table 3.2 is

a more detailed version of table 3.1. For each statistically significant change detected

by the Brown–Forsythe tests, the values of the variance and of the median before and

after the inclusion of the visual anchors are reported. As it can be seen, most of the

cases of statistically significant changes in the variance are actually associated with a

shift of the median of the distribution (considered instead of the mean given the small

dimension of the samples) towards one of the extremes of the 0–10 range. Finding a

fixed value of the median above or below which the results of the test can be consid-

ered reliable is not easy, but if we assume the interval 2–8 to be a safe guess (having

20% of the possible 0–10 values on either side) we see that only 5 of the cases reported

in table 3.2 have the median in this interval both before and after the inclusion of the

visual anchors: one increment in variance for colour uniformity (lesion P206c), two

reductions in variance for border regularity (lesions D414b and D578a) and two in-

crements for roughness of texture (lesions P206b and P337a). Two border-line cases

are represented by the significant increment in variance for the roughness of texture of

lesions D414b and D578a despite the shift of their medians towards the upper bound.

On the basis of these considerations, it is important to interpret the data presented in

table 3.3 with extreme care. These data were obtained testing the statistical signifi-

cance of the changes in the average variance for each of the five properties with the

Mann–Whitney test. The Mann–Whitney test [22] is an extension of the Wilcoxon test

[22] to cases where the samples have different sizes. In turn, the Wilcoxon test is a

non-parametric test often used in place of the Welch’s t-test [54] when the assumption

of Gaussianity does not hold for the samples.

As pointed out before, there is no statistical means of deciding which of the cases

should be considered and which should be ignored because their median is too close

to one of the extreme values of the scoring scale. For this reason all the data have been

included in the test, but the results must be considered with the caveat that the relia-

bility of reductions or increments cannot be guaranteed and each specific case must be


Colour Uniformity

Lesion reference σ2before σ2

after Median before Median afterD262b 2.840 0.980 2.407 0.500

D726 3.266 1.00 1.593 0.667

P206c 1.985 5.709 6.444 5.426

P337a 6.027 0.525 1.926 0.538

P446 2.365 5.277 8.074 7.500

Asymmetry



P337c 2.344 0.807 1.704 0.370

Border Regularity



D270 6.620 1.723 2.815 1.315

D414b 8.849 3.746 6.000 2.982

D578a 9.402 4.667 6.593 4.519

D726 2.984 1.890 2.037 1.019

P257 7.762 2.928 2.556 1.241

P306a 8.344 2.064 2.741 1.352

P337a 1.638 0.180 1.037 0.204

P337c 1.714 0.450 1.370 0.241

P337e 5.781 2.599 4.000 0.889

Texture Roughness



D414b 2.123 4.443 7.963 8.222

D578a 1.904 5.041 8.296 8.482

D726 3.972 1.718 2.630 1.037

P206b 3.737 6.558 6.926 6.834

P337a 3.798 7.601 7.667 7.074

Table 3.2: Statistics of the scores obtained before and after the introduction of the visual

anchors for the statistically significant changes detected by the Brown–Forsythe tests.


σ2before σ2

after Ha p-valueColour 2.230 2.386 σ2

be f ore < σ2a f ter 0.3086

Colour Uniformity 3.892 4.190 σ2be f ore < σ2

a f ter 0.2317

Asymmetry 5.030 5.073 σ2be f ore < σ2

a f ter 0.3507

Border Regularity 5.600 3.417 σ2be f ore > σ2

a f ter 0.0004

Texture Roughness 4.955 5.840 σ2be f ore < σ2

a f ter 0.0542

Table 3.3: Results of the Mann–Whitney (one-sided) tests on the change of the aver-

age variance after the inclusion of the visual anchors. The alternative hypothesis is

presented in column Ha.

separately evaluated. In particular, the extremely high significance of the reduction of

the average variance for border regularity could arguably be considered a wrong esti-

mate as table 3.2 shows clearly that most of the cases of reduction in the variance of

this property are obtained when the median is near either 0 or 10. The other significant

change (nearly at 95%) is the increase in the average variance of texture roughness

observed despite the fact that 4 out of the 6 lesions for which the change is significant

had the median of the recorded score moved towards one of the extremes. Considering

this, it can probably be concluded that such result is reliable. The other three proper-

ties (colour, colour uniformity and asymmetry) do not show any significant change and

hence there is no reason to question the soundness of the associated tests.

Overall, two conclusions can be drawn. First of all the inclusion of the visual anchors

did not have any considerable impact on the variability of the answers. The only sta-

tistically significant result seems to be an increase in the variance measured for the

evaluation of the roughness of texture, while the reliability of the figures obtained for

the border regularity is debatable.

These results are quite important as they prove that the variability observed in the eval-

uation of skin lesions obtained following qualitative guidelines like the ABCD rule is

not mainly due to a subjective interpretation of the concepts on which the guideline is

based (e.g. regular/irregular, symmetric/asymmetric, ...) since the inclusion of visual

cues did not reduce the observed variance. If the variability could be really ascribed

to the intrinsic subjectivity of the assessment, as the experimental results seem to sug-

gest, the usefulness of guidelines like the ABCD rule would be under serious question.

Secondly, even after the inclusion of the visual anchors, the value of the variance is

quite high. The obtained standard deviations, in fact, range from a minimum of 1.545


for colour to a maximum of 2.417 for roughness of texture, which are quite high when

considering that the scoring values range between 0 and 10. This substantial lack of

agreement about the evaluated concepts, makes learning useful rules from the gathered

data virtually impossible for any machine learning technique.

The analysis of the difference between the answers given by people with different level

of skin-related knowledge could not be performed as precisely as hypothesised in the

first place. This was due to the fact that 4 dermatologists and 5 medically-trained peo-

ple are insufficient to get any meaningful estimate of the variance within these groups.

It can be reported, though, that the range in which the answers of these two groups

are observed is comparable to that of laypeople, suggesting that the subjectivity of the

evaluation is predominant on other elements such as prior knowledge derived by edu-

cation or experience. It is fair to say, though, that this observation is based on a limited

amount of data and hence cannot be considered relevant without further study.

3.2.2 Correlation of Different Properties

Other kinds of analysis that can give an interesting insight into the data are based on

the study of the distribution of the answers. In particular, the analysis of correlation

between the five properties evaluated by the survey participants can give a better un-

derstanding of the collected data under two different points of view. On one hand par-

ticular correlation structures can suggest relations between the elements of the ABCD

rule which might affect the evaluation process of the observer. On the other hand, un-

derstanding the correlation between different elements is interesting from the point of

view of the development of automatic systems, as the decision on the balance between

redundant and independent features is a key part in the design of any classifier.

The correlation analysis has been performed on all the 45 images shown, not only on

the 20 in common with previous experiments, and at different levels. At first all the

data were considered together to get a general idea of any possible correlation pattern,

then the clinical classes have been analysed one at a time to verify whether the corre-

lation was stronger in some classes than in others.

Figure 3.5 shows the scatter-plots obtained from the five lesion properties over 1,890

observations. Each of the five properties is represented by a row and a column of the

matrix and, for each pair, both the scatter-plot and the correlation coefficients are re-

ported. Along the main diagonal a histogram showing the distribution of each single

property is included.


In order to better capture the correlation of the data two different coefficients were

considered. Along with the classical Pearson’s correlation coefficient [45]

ρP =cov(x,y)σx ·σy

Spearman’s rank correlation coefficient ρS was computed [22]. Spearman’s ρS is a non-

parametric measurement of the monotonicity of the correlation between two variables.

In particular, it is based on the ranking of the observed samples rather than on their

raw values. In other words, considering n observations characterised by two variables

each, x and y, observations are ranked according to their value so that, say for variable

x, rank(i) < rank( j) if xi > x j where i and j represent two different observations.

Once the ranking according the two variables has been obtained, a ranking difference

is computed di = rxi − ry

i where rxi and ry

i represent the ranking position of observation

i with respect to variables x and y. The correlation coefficient ρS is then computed as

ρS = 1− 6∑d2i

n(n2−1)

Spearman’s coefficient is useful to complement the information obtained with Pear-

son coefficient. In particular, since Spearman’s coefficient is computed on the rank of

the observation, it is better suited to capture correlation when the monotonic relation

between the variables is non-linear. Moreover, it is more robust to the presence of out-

liers and hence produces more stable results. Both from the coefficients and by visual

inspection, it appears clear that while colour and roughness of texture do not show any

significant correlation with the other variables, colour uniformity, asymmetry and bor-

der irregularity are positively correlated. The fact that colour and roughness of texture

appear uncorrelated to the other properties allows to exclude that the observed correla-

tions are only due to some bias induced by the way users answer, such as preference for

high or low scores. If that was the case, in fact, all the variables should be correlated

and not only those which are part of the ABCD rule.

As all the data are considered together very different distributions are mixed and, as a

result, the correlation coefficients tend to be a less precise representation of the struc-

ture of the data. Visual inspection, on the other hand, guarantees a better understanding

of the different tendencies which can be seen as regions of the graph with high density

of points. Besides the marked presence of points along the diagonal in those scatter-

plots where correlation is significant, horizontal and vertical lines in correspondence to

the middle of the grading scale (around 5) can be observed in most graphs, suggesting

that people used an intermediate value very often. Moreover, some clustering of the


Colour

0 4 8 0 4 8

04

8

04

8 ρP = 0.19

ρS = 0.18

Col.Uniformity

ρP = 0.12

ρS = 0.11

ρP = 0.66

ρS = 0.66

Asymmetry

04

8

04

8 ρP = 0.097

ρS = 0.097

ρP = 0.50

ρS = 0.50

ρP = 0.64

ρS = 0.63

Border

0 4 8

ρP = 0.26

ρS = 0.25

ρP = 0.17

ρS = 0.18

0 4 8

ρP = 0.23

ρS = 0.25

ρP = 0.16

ρS = 0.17

0 4 8

04

8Roughness

All Classes

Figure 3.5: Matrix of scatter-plots showing the correlation patterns between the different

evaluated properties. The correlation coefficients ρP (Pearson’s) and ρS (Spearman’s)

are reported. The data of all clinical classes were included.


Col.Uniformity

0 2 4 6 8 10

02

46

810

02

46

810

ρP = 0.67

ρS = 0.69

Asymmetry

0 2 4 6 8 10

ρP = 0.51

ρS = 0.54

ρP = 0.63

ρS = 0.65

0 2 4 6 8 10

02

46

810

Border

Benign Nevus

(a)

Col.Uniformity

0 2 4 6 8 10

02

46

810

02

46

810

ρP = 0.74

ρS = 0.75

Asymmetry

0 2 4 6 8 10

ρP = 0.61

ρS = 0.61

ρP = 0.73

ρS = 0.72

0 2 4 6 8 10

02

46

810

Border

Melanoma

(b)

Col.Uniformity

0 2 4 6 8 100

24

68

10

02

46

810

ρP = 0.56

ρS = 0.55

Asymmetry

0 2 4 6 8 10

ρP = 0.25

ρS = 0.24

ρP = 0.38

ρS = 0.36

0 2 4 6 8 10

02

46

810

Border

Seborrhoeic Keratosis

(c)

Col.Uniformity

0 2 4 6 8 10

02

46

810

02

46

810

ρP = 0.56

ρS = 0.54

Asymmetry

0 2 4 6 8 10

ρP = 0.47

ρS = 0.45

ρP = 0.68

ρS = 0.65

0 2 4 6 8 10

02

46

810

Border

Dysplastic Nevus

(d)

Figure 3.6: Results of correlation analysis for each of the four clinical classes of lesions.

data can be observed. The scores for border irregularity, as an example, show a clear

concentration in the 0–1 interval.

In order to understand if particular situations like this are due to one specific clinical

class, the analysis has been repeated to a finer level of granularity, dividing the obser-

vations according to the type of lesion they refer to. Within the single classes, likewise

the general case, no significant correlation is found for colour and roughness of texture,

hence Figure 3.6 reports only the results of correlation analysis for colour uniformity,

asymmetry and border irregularity.

Both benign nevi and melanomas show high values of correlation between the consid-

ered properties, dysplastic nevi follow with a strong correlation between asymmetry


and border irregularity, while seborrhoeic keratoses show a less sharply defined corre-

lation between the properties, with only a weak correlation between colour uniformity

and asymmetry. While in general cases the magnitude of the correlation coefficients

would not be considered particularly high, especially for the coefficients in the 0.60

range, it is interesting that values as high as 0.75 have been reached despite the ex-

tremely high variability in the answers reported in previous sections. This suggests

that even though people express very different personal judgements about the same

image, their evaluation of colour uniformity, asymmetry and border irregularity is usu-

ally highly correlated. In part this was expected as, following the guidelines of the

ABCD rule, the distribution of colour affects both colour uniformity and asymmetry,

and similarly a jagged edge would in many cases affect border irregularity as well as

asymmetry, but to the knowledge of the author no report of this has been made before

and no empirical evidence has been presented.

These findings have at least two important implications, one more related to the assess-

ment performance of humans, the other linked to the design of automatic classification

systems.

If on one hand asking people to evaluate correlated properties can be useful as fewer

doubtful cases should be faced, since the properties should be consistent (e.g. it is rare

to find a symmetric lesion with highly irregular borders) and help spotting the danger-

ous lesions, on the other hand the information content of a set of redundant properties is

reduced and hence different evaluation criteria might capture more information about

the lesions.

Considering the observed correlation, then, one problem related to the design of au-

tomatic classifiers based on the ABCD rule appears clear: if one of the criteria is not

evaluated correctly, then highly likely the others will be incorrect as well. Since a

correct evaluation of colour uniformity, asymmetry and border irregularity is highly

dependent on the detection of the border of the lesion, a sub-optimal segmentation

could lead to a wrong evaluation to all the three property resulting in a inaccurate

classification. Considering that segmentation is a very difficult task for the reasons

previously discussed, relying so heavily on it to obtain a correct classification seems

quite inappropriate.

Additionally, from the distribution of the answers given for melanomas another in-

teresting fact emerges. As shown in Figure 3.6(b), the histogram of both asymme-

try and border irregularity resembles a lot the probability density function of a Beta

distribution with α = β = 0.5, having high probability for the extreme values and a


substantially uniform distribution between them. Since the ABCD rule recommends

to classify as possible melanomas lesions showing high values for at least one of the

ABCD criteria, the distribution of the answers obtained in this empirical study sug-

gests that many of the cases would not be detected. This is especially true because, for

the high correlation previously highlighted, it is unlikely to observe cases with two out

of the three ABC properties ranked as “non suspicious” and the third as “potentially

dangerous”. This gives another good reason to avoid developing automatic classifiers

implementing directly the ABCD rule.

3.3 Discussion

In this chapter the results from an empirical study on the performance obtained from

people when following the ABCD rule have been presented. Several conclusions can

be drawn to answer the initial question regarding whether the ABCD rule could be

useful in the development of automatic classifiers.

The first important point emerging from the data analysis is that providing visual cues

to the participants did not help reducing the variability of the their answers. The most

likely explanation behind this is the subjectivity in the evaluation of the degree of

similarity given by different people when comparing the same item to standard visual

references.

While at the beginning a Fuzzy Inference System implementing the ABCD guideline

was regarded as a possibly valuable tool to be developed during the dissertation, the

results presented show that this approach is clearly infeasible. This is due to the ex-

treme variability of the obtained answers which would make it impossible to learn any

rule on how to classify, for example, a lesion as asymmetric. For the same reason,

any system trying to learn the human decision process from the gathered data would

arguably fail.

A second result, even more important from the point of view of the development of

automatic classifiers, is the evidence of a high correlation between colour uniformity,

asymmetry and border irregularity. This correlation between the different properties

would make an automatic system based on them extremely prone to errors due to the

limited amount of information used in the classification process caused by the lack of

independence between them.

In conclusion, the ABCD rule seems overall inappropriate for automatic classification

systems and its use is discouraged.


As a side-note, the results obtained in the experiment would suggest that the ABCD

rule is not an appropriate guideline for humans as well. Being the subject complex,

though, any conclusion on this final claim should be drawn and motivated by derma-

tologists whose expertise in the field can provide better grounded opinions.

Chapter 4

Visual Similarity of Skin Lesions

As previously explained, automatic classification of skin-lesion images is a hard task

for several reasons. One of the most relevant lies in the fact that since classification is

performed according to medical criteria, lesions belonging to the same class can have

very different appearance. Apart from the “soft” characterisation of the melanoma pro-

vided by the ABCD rule [23] or the 7-point check-list [7], which present a series of

possible warning signs rather than a real visual classification, to the author’s knowl-

edge there is no available classification of skin lesions purely based on their visual

properties. While this is clearly not a problem for specialists, it makes classification

quite challenging for automatic systems. The results of the experiment presented in the

previous chapter, moreover, show how visual guidelines developed by dermatologists,

such as the ABCD rule, can give a very high variability in the results when followed

by laypeople. Such finding suggests the need for guidelines reflecting more closely the

criteria spontaneously used by people with no medical training when evaluating skin

lesions. A deeper understanding of these criteria would, additionally, provide a good

starting point to find characterising features which can be used for automatic classifi-

cation. For this reason a study focused on extracting the specific aspects of perceived

similarity has been set up.

Given the short amount of time available for data collection, it was decided to focus

the study on Basal Cells Carcinoma (BCC) for several reasons. First of all the class

of BCCs contains lesions having very different appearance (see Figure 2.2). Even

though some properties are coherent in the whole class (for example BCCs are gener-

ally non-pigmented, i.e. “pale”, lesions) others vary quite a lot. An example of these

non-homogeneous properties is shape, as shown in Figure 4.1. Using only BCC im-

ages, hence, increases the probability of avoiding uninteresting trivial classes such as

31

Chapter 4. Visual Similarity of Skin Lesions 32

(a) (b) (c)

(d) (e) (f)

Figure 4.1: Basal Cell Carcinomas showing substantial differences in shape.

pigmented vs. non-pigmented lesions which could arise in a mixed image-set contain-

ing, for example, Melanocytic Nevi and BCCs. Furthermore, given its variability of

appearance, the BCC class is harder to classify than others using automatic systems

based on image-analysis techniques. A new classification system based on the features

which show good discriminative power among the visual classes would surely be a

useful addition to the ones under continuous development at this University.

In this chapter the procedure followed to define visual classes will be presented. An

overview of the system developed for data collection will be given in section 4.1, while

the actual procedure followed to define the visual classes will be outlined in section 4.2.

4.1 Data Collection

The decision to define visual classes through the evaluation of the feedback provided

by users highlights the importance of an appropriate data collection infrastructure. In

order to avoid biases in the obtained data, the system should simply support the action

of the users without significantly affecting their behaviour.

Aldridge et al. [6] reported how people with no medical training could successfully

group skin-lesion images in visually coherent classes during some experiments. Com-


parable approaches have been presented in visual similarity studies [38, 46] conducted

on colour patterns and natural images. Since the approach proposed in [6] and [38]

required a direct comparison between all the pairs of images in the image-set, it could

not be directly scaled to a wide image-set without an unreasonable increment in the

time required to the user. Rogowitz et al. [46], on the other hand, chose a more time

effective approach showing, on a computer, a sequence of screens containing one tar-

get image and eight samples each, and asking volunteers to match the target with the

most similar sample. While keeping the required time manageable, this experimental

set-up induced the drawback of obtaining a very limited amount of information from

each of the proposed screens. Specifically, that was due to the request of matching the

target with the most similar sample. In case two or more very similar samples were

shown, in fact, only one could be matched with the target, loosing records of similarity.

More subtly this approach could increase the effect of the subjectivity of the evalua-

tion. Since two similar samples have roughly the same probability of being chosen, in

fact, the recorded match would give more importance to the subjective preference of

the user than to the actual similarity with the target. Even though this condition is rare

due to the randomised generation of the screens, it is nonetheless undesirable since it

can lead to obtain votes evenly spread over similar sample stimuli, possibly inducing to

wrong interpretations in the data analysis. Moreover no mention is made in [46] to any

option given to the user to avoid providing any match if all the samples are considered

dissimilar to the target.

The application developed during this dissertation for data collection was designed to

avoid the aforementioned shortcomings. The interface, see Figure 4.2, consists in a

sequence of 10 screens each presenting to the user 2 target images (top of the screen)

and 24 samples (pooled at the bottom of the screen). When generating the screens,

images are randomly selected to satisfy two conditions:

1. each image must appear only once as a target in the whole experiment

2. if one lesion is chosen as a target, it cannot appear as a sample in the same screen

The user is required to assign 0–6 samples to each target simply by moving the corre-

sponding image in one of the slots available in the central part of the screen. Allowing

the user to match multiple samples to the same target alleviates one of the limitation

highlighted for [46], since whenever similar samples are present in a screen, they can

all be matched to the same target. This is expected to reduce the impact of the subjec-

tive evaluation of the resemblance on the final result, supporting the emergence of the


Figure 4.2: Web interface developed to collect similarity assessment data.


Figure 4.3: Diagram of database tables for experiment 2.

underlying similarity of the images. Since the user is not forced to assign at least one

sample to each target, moreover, another possible bias of [46] is eliminated. Finally,

limiting the maximum number of matches to 6 should avoid selections made according

to loose similarity criteria.

The choice of showing 2 targets for each screen has the double function of maximising

the data gathered and of forcing the user to make a decision between two possibilities,

which should cause a more careful evaluation of similarities and differences observed

between the images.

One problem of this set-up is that, due to the completely random selection procedure,

the targets in a screen could be very similar to one another, making it difficult for the

user to decide how to match the samples. The effects of this drawback, though, have

been considered less important than the benefit gained by the presence of multiple tar-

gets.

In order to reach as many people as possible, the interface was implemented as a web-

application running on a Tomcat server and is available, as of August 2010, at the

address http://demos.inf.ed.ac.uk:8848/webs2. The application was designed to guar-

antee easy user interaction and to facilitate image evaluation. Some of the implemen-

tation choices made in this direction include presenting the lesion images on a black

background, which enhances colour perception, and providing easy zooming for the

small sample images.

Along with the matches, the structure of each screen and some demographic data of

the users are stored on a PostgreSQL server. A diagram of the database tables can be

found in Figure 4.3.

In total, a set of 115 BCC images have been used in the experiment. Given the short

time available for the data collection phase, a small number of participants was hypoth-

esised for the experiment. In this scenario, including all the images in the first place

would have resulted in a similarity matrix providing only a rough approximation of

the similarity structures. This is due to the fact that, as discussed later in section 4.2.1,

the similarity between two images is modelled by the frequency with which they have


been matched together. By a parallel with a frequentist approach to probability, we

know that this approximation is only asymptotically unbiased and in small samples

can present substantial errors. For this reason, images need to appear a fair number

of times in the same screen before the computed index of similarity can be considered

reliable. In order to grant this, in the first phase of the experiments only 30 out of

the 115 images were actually used. Once a reasonable number of answers had been

collected, then, the image set have been enlarged, but in an asymmetric way. Ten more

images have been added to the pool from which targets are selected, while all the 115

images have been considered as possible samples. This approach is intended to widen

the coverage of the study by making the similarity matrix less sparse, while still grant-

ing as much reliability in the estimates as possible. At the time of writing 43 people

have contributed giving their answers to the survey for a total of 2,395 sample-target

matches. While the number can seem big, it would have been widely insufficient to

estimate unbiased indexes for the total 6,555 possible pairs had all the 115 images been

used in the first place.

4.2 Definition of Visual Classes

Once collected, the data regarding the similarity expressed by the survey participants

through the matching of samples and targets had to be analysed. The goal of the anal-

ysis was the creation of visual classes to be used for two different purposes. In the first

place, the creation of classes from the data was aimed at showing the groups defined

by the users on the basis of visual similarity. The visualisation of these groups was

meant to give both an insight on the criteria followed by people while assessing simi-

larity, and an outlook of the topological organisation of the groups, which is useful to

understand their relation. This was important especially from the point of view of the

dermatologists as they could evaluate whether people were able to group together simi-

lar lesions without any instruction regarding the similarity criteria to follow. Secondly,

the definition of visual classes was functional to the creation of the training image-set

for an automatic classifier designed to learn the principles used by humans during the

experiment.

Even though the two tasks present several similarities, the peculiar needs of the two

purposes of the classification required the use of two specific techniques.

In order to understand the criteria used to assess visual similarity, visual classes should

be defined in a “soft” way, allowing elements to act as a connection between differ-


ent homogeneous clusters. Moreover, the groups must be displayed together in a way

which makes immediately clear how similar or dissimilar images or groups of images

have been considered. For this task real boundaries between classes are not needed,

but the similarity structure must appear clear at a glance. The best technique to per-

form this kind of analysis is Multi-Dimensional Scaling which will be presented in

section 4.2.1.

The training of a classification system, on the other hand, requires homogeneous and

well separated classes as supervised learning algorithms need the training samples to

be uniquely labelled as belonging to one specific class. While fuzzy classifiers or prob-

abilistic mixture models could have been used to relax this necessity, it was decided to

learn separated classes and cope with intermediate cases through probabilistic classifi-

cation. In order to define the appropriate groups a technique called Spectral Clustering

(presented in section 4.2.2) has been used.

The following part of the chapter will concentrate on explaining these two techniques

and reporting the obtained results.

4.2.1 Multi-Dimensional Scaling

As in many other studies based on perceptual similarity [6, 38, 46] the data gathered

in the evaluation sessions have been used as an input for Multi-Dimensional Scaling

(MDS) [11]. Multi-Dimensional Scaling is a technique used to map the similarity

structure of a given set of elements, as expressed by a distance matrix, to a geometric

configuration sitting in a space with a specified number of dimensions. Being a data

visualisation technique, Multi-Dimensional Scaling is generally used to produce 1D

or 2D plots in which the points, representing the elements whose similarity is known,

are laid out in a configuration reflecting the relations expressed by the distance matrix.

The use of Multi-Dimensional Scaling to produce configurations in higher-dimensional

spaces is rare because of the inherent difficulties in the visualisation. It is nonetheless

possible provided that a suitable solution for inspecting the data is found.

Incrementing the number of dimensions of the target space has the effect of allowing

new dimensions in which the data can be organised. In practice, progressively adding

new dimensions it is possible to try to capture at a finer level the hierarchy of evalua-

tion criteria which the users followed to generate the similarity structure.

Various formulations of MDS exist [11] and may differ in the way they find the fi-

nal layout of the points. In general MDS is formulated as an optimisation problem


where the the cost function to be minimised is a measurement of how much the dis-

tance between each pair of points di, j is different from the corresponding element δi, j

of the distance matrix given as an input. The specific formulation depends on the kind

of MDS algorithm chosen. Without going through very specific details, the two most

important types of MDS are metric and non-metric MDS. The difference lays in the as-

sumptions each of the two models makes on the structure of the distance matrix given

as the input. These, in turn, translate in different formulations of the optimisation prob-

lem. For our purposes non-metric MDS was chosen, as it is more flexible posing less

restrictions on the structure of the distance matrix. This was necessary as the way such

matrix is derived cannot guarantee some of the key properties of metric models. The

price to pay to get this flexibility is a less informative final layout, where the observed

distances cannot be interpreted as real metric distances but rather as ranking informa-

tion. In other words given three points A, B and C, if the distance between A and B

is lower than that between A and C we can only conclude that B is more similar to A

than C but the magnitude of the distances does not carry any significant meaning.

Before running the MDS algorithm, two preparation steps are necessary: first of all the

collected data must be translated into a similarity matrix, then the similarity matrix has

to be converted in a distance matrix which can be used for MDS.

The similarity matrix S was created on the basis of the recorded matches. Each ele-

ment si j of the matrix represents the similarity between lesion i and lesion j and was

computed as

si j =recorded matches between i and j

total appearances of i and j in the same screen

By definition each lesion is considered to be maximally similar to itself, so it is im-

posed that sii = 1. This is necessary as the elements on the diagonal of the similarity

matrix are ignored by the previous formula, since one image cannot appear in a screen

both as a target and as a sample. Since Multi-Dimensional Scaling algorithms take

distance measurements as an input, the similarity matrix S was converted to a distance

matrix D. A standard conversion formula was used, where

di j =√

sii + s j j−2si j

It is important to underline that even though in general this conversion guarantees the

non-degeneracy property

di j = 0 ⇐⇒ i = j


since, for a set of non corresponding points, only the elements on the diagonal of the

similarity matrix have value 1, in our case this is not necessarily the case. Even though

it is rare, in fact, it can happen that two lesions are matched every single time they ap-

pear in the same screen. This would cause their similarity to be 1 and their distance to

be 0. In order to avoid this inconvenience, if any off-diagonal element of the similarity

matrix equals 1, it is decreased to be slightly lower. This artificial modification of the

similarity matrix has the only effect of avoiding the fusion of multiple points and al-

lows a better structure of the final layout without substantially affecting the similarity

structure expressed by the users. It must be said, though, that up to now this procedure

has never been necessary since the highest similarity recorded is 0.90.

While in the exploratory data analysis phase the package ggobi [3] proved extremely

useful for its flexibility, for the final MDS analysis the R function isoMDS was used.

The results obtained for MDS in 1D and 2D can be seen in Figure 4.4 and Figure 4.5

respectively. In both cases MDS was performed on a subset of the complete collec-

tion of images, in particular on the 30 images which were included in the study in

the first place. The main reason behind this choice is that since the purpose of MDS

is understanding the main dimensions evaluated by people when assessing similar-

ity, the similarity measurement fed into the algorithm must be correct. As previously

mentioned, this happens only for pairs of images which have appeared several times

together in the same screen. For this reason, only the 30 images included in the survey

from the beginning constituted an appropriate sample.

Performing MDS in one dimension is equivalent to allow for only one criterion in the

evaluation of the similarity between the different skin-lesion images. This analysis,

hence, is supposed to find the most important aspect which guides the assessment pro-

cess followed by the observers. Figure 4.4 shows the results of a 1D MDS, where the

horizontal axis represents the only dimension of variation allowed while the vertical

displacement of the elements is introduced merely for visualisation purposes and car-

ries no meaning. In order to avoid confusion, only some of the images have been added

to the graph to make clear why the lesions had been considered similar without having

too much clutter on the plot. Conversely, the id of all the lesions is reported for readers

interested in consulting the DERMOFIT lesions database. For a correct interpretation

of the plots, please note that both in Figure 4.4 and in Figure 4.5 the real configuration

is that of the labels. Pictures have been added where they fitted best, but sometimes

their position might be misleading. Always refer to the labels when making consider-

ations about the layout produced by the MDS algorithm.


Figure 4.4: Results of Multi-Dimensional Scaling with a 1D output layout. The vertical

displacement is added only for visualisation purposes and does not carry any specific

meaning.


Figure 4.5: Results of Multi-Dimensional Scaling with a 2D output layout.

From visual inspection it seems reasonable to assume that the most important similar-

ity criterion used in the evaluation of BCCs is vertical prominence. On the left of the

graph, in fact lesions are flat and relatively smooth. Moving towards the right-hand-

side, the elevation of the lesion from the skin level progressively grows, up to the point

where, to the extreme right, the lesions are dome-shaped with a clear rise above the

surrounding skin.

When performing MDS in two dimensions a second evaluation criterion is allowed to

emerge. Figure 4.5 shows the obtained results. Examining the plot, the importance of

the elevation from the skin level is still considerable and is captured by the horizontal

dimension, while the meaning attributed to the vertical axis is less sharply defined.

Apparently, this should be linked to a general concept of smoothness, but it is not clear

whether importance is given to the smoothness of the surface of the lesion per se or to

the presence of scabs or bleeding patches, as in both cases exceptions to the general

rule seem to be present. Regardless the real meaning of the second dimension, though,

the presence of visually homogeneous lesions in each region of the plot and the appar-

ently smooth transaction between harshly different groups, would suggest that people

can effectively perceive similarity between skin lesions. To better understand this find-

ings, more effort should be dedicated to investigating the similarity criteria used. To


go in this direction, the web application used to collect data will be available on-line

even after the end of this project, since a wider collection of answers is expected to

shed more light into the actual hierarchy of evaluation criteria.

4.2.2 Spectral Clustering

Despite its usefulness for visualising the structure of the data expressed by the simi-

larity matrix, Multi-Dimensional Scaling is not an appropriate technique for creating

the visual classes necessary for the training of an automatic classifier. The reason be-

hind this is that MDS algorithms do not create any group but simply a configuration of

points which is as coherent as possible to the underlying similarity structure. In order

to obtain a set of separate classes, then, a clustering algorithm needed to be used.

While many clustering algorithms have been presented in the years, most of them are

not suitable to solve the problem at hand. Nearly all the clustering algorithms, in fact,

need as an input the coordinates of the points which have to be divided in groups. As

the data collected from the experiments do not provide any form of coordinates, these

approaches to clustering could not be followed.

Even though clustering the points obtained through MDS might seem a solution to this

problem, it is not the case as after non-metric MDS the distance between the points

carries only ranking information while its magnitude, which would be needed for clus-

tering, is not significant. As most of the studies regarding perceptual similarity (e.g.

[38, 46]) are not really aimed at creating classes but rather at understanding the cri-

teria followed by people to decide whether two elements are similar or dissimilar, no

application of clustering is on average required in this kind of research. Clustering,

on the other hand, is generally applied when a set of well defined points located in a

specific space need to be divided in groups according to their position. The result of

such specificities, is that hardly any documentation can be found concerning how to

cluster elements when the only available data is represented by a measurement of their

similarity.

The solution to this situation was found in the partial application of a technique known

as spectral clustering. Since one of the first steps of any spectral clustering algo-

rithm is the derivation of a similarity matrix starting from the coordinates of the points

which are to be grouped, skipping this first passage and providing the algorithm di-

rectly with the similarity matrix obtained from the matching experiments seemed a

principled workaround to obtain the desired clustering of elements (skin-lesion im-


Algorithm 1 Spectral Clustering Algorithm by Ng et al. [41]Task: divide a set of points S = {s1, . . . ,sn}, with si ∈ Rl , in k clusters

1. Form the affinity matrix A ∈ Rn×n defined by Ai j = exp(−‖si−s j‖2

2σ2

)if i 6= j and

Aii = 0

2. Define D to be the diagonal matrix whose (i, i)-element is the sum of the i-th row

of matrix A

3. Construct the L matrix: L = D−1/2AD−1/2

4. Find the k largest eigenvectors of L: x1, . . . ,xk

5. Form the n× k matrix X = [x1 . . .xk] by stacking the found eigenvectors as

columns

6. Form matrix Y from X by renormalising the rows of X to have unit length i.e.

Yi j = Xi j/(∑ j X2i j)

1/2

7. Treating each row of Y as a point in Rk, use a clustering algorithm, such as k-

means, to find k clusters

8. Replicate the clustering on the original points assigning si to cluster j if and only

if the i-th row of matrix Y was assigned to cluster j

ages) which could in no way be interpreted as points of a given space.

Among the different perspectives from which spectral clustering can be seen (see von

Luxburg [53] for an extended tutorial), the more intuitive one is that based on similarity

graphs. It can be proved, in fact, that performing spectral clustering is mathematically

equivalent to solving a partitioning problem on a graph so that edges between different

groups have low weight and edges within the same group have high weight. If a graph

is built so that each edge connecting two vertexes has an associated weight represent-

ing the similarity between them, solving the aforementioned partitioning of the graph

leads to a clustering of the vertexes based on their similarity. Finding this partitioning

passes through spectral graph theory (see Chung [19]) and implies the computation of

the graph Laplacian.

A wide variety of spectral clustering algorithms can be found in literature, the one im-

plemented and used to find the visual classes is that proposed by Ng et al. [41] which is

outlined in Algorithm 1. As mentioned before, the data obtained from the experiments

cannot be directly represented as points in any particular space, but a similarity ma-

trix can easily be derived from the stored matches between samples and targets. Due to


Figure 4.6: Hierarchical structure of the derived clusters.

this, spectral clustering techniques can be used to derive visual classes. In particular, in

order to apply the algorithm by Ng et al. outlined in Algorithm 1, it is sufficient to sub-

stitute the first step with the procedure presented in section 4.2.1 to obtain a similarity

matrix (or affinity matrix as in Algorithm 1) which can then be used to follow the rest

of the procedure. A minor intervention is required on the similarity matrix computed

from the matches, as many of its elements have value zero while, by construction, the

affinity matrix computed in the spectral clustering algorithm does not have any off-

diagonal zero elements. To avoid issues during the computations, all the zero elements

of the similarity are changed to an arbitrarily small number before beginning to follow

the algorithm.

Due to the mathematical operations performed on the similarity matrix in the first

steps of the algorithm, though, complications might emerge if quasi-multicollinearity

of the similarity matrix could cause instability in the achieved results. Even though it

is rare, there is no way to guarantee, in fact, that the similarity matrix does not show

quasi-multicollinearity, especially considering the high number of its elements having

value zero. In order to test the stability of the clustering procedure and verify if such

a problem should be of major concern, the data have been split in two separate sam-

ples (respectively including the answers of 22 and 21 people) and clustering has been

performed independently on them. In order to obtain better results, a hierarchical ap-

proach was chosen, creating 2 clusters for every clustering level. Given this choice,

only the first 2 eigenvector of matrix L had to be retained (see Algorithm 1) and hence

the similarity configuration was basically projected in a 2-dimensional space. Cluster-

ing was then performed with k-means [51] requiring 2 clusters for each hierarchical

run. Given its homogeneity, cluster 1 was never further divided, and the hierarchical


procedure focused on cluster 2 for which three more levels were extracted. In order

to better understand the hierarchical structure, an schematic outline is provided in Fig-

ure 4.6.

The clusters resulting from the analysis on the 30 images included in the experiment

from the beginning are shown in Figure 4.7 for sample 1 and in Figure 4.8 for sample

2. As can be seen by comparison, at the first hierarchical level the two sets of answers

gave the same results, as cluster 1 and cluster 2 have the same elements in the two

cases. Looking at the lower levels in the hierarchy, conversely, some differences can

be seen. Despite these discrepancies, consistent overlap can be found in the two cases.

Sub-cluster 2.1 of Figure 4.7, for example, is the same of sub-cluster 2.2a in Figure 4.8

with the inclusion of one additional lesion. Similarly, sub-cluster 2.2b of Figure 4.8

has many elements in common with the union of sub-clusters 2.2a and 2.2b of Fig-

ure 4.7. Even though the clustering results are not identical beyond the first level of the

hierarchy, it is fair to say that the differences are not extreme and, most importantly,

that the classes show some peculiar aspects which may lead to define the belonging

lesions as visually similar.

Considering the coherence in the highest level of the hierarchy and the relatively small

dimensionality of the two samples, it can be concluded that the differences observed at

lower hierarchical levels are not of major concern and may ultimately disappear with

wider samples. For this reason, spectral clustering can be considered stable enough de-

spite the structure of the similarity matrix, and can be used to define the visual classes

which will eventually be used in the automatic classification system. The results ob-

tained running the implemented code on the complete dataset can be seen in Figure 4.9.

The first hierarchical level is the same obtained on the two partial datasets, while the

following levels present some differences with previous results. Despite that, a lot of

commonalities can be observed. Sub-cluster 2.2b of Figure 4.9, for example, is gener-

ally preserved in both the previous cases and aggregates similar to sub-cluster 2.2a can

be found in the other figures as well.

Given the homogeneity shown by the elements of each of the obtained visual classes,

it can be concluded that the intuition of using spectral clustering in this unusual way

was correct and produced the expected outcomes.

As a further test, the results of spectral clustering have been mapped to the output

configuration provided by Multi-Dimensional Scaling (MDS). The result is shown in

Figure 4.10. The colours are used as coding for the different clusters. Red has been

used for the first homogeneous set of lesions found at the first level of the hierarchical


Cluster 1

Cluster 2

Sub-cluster 2.1

Sub-cluster 2.2a Sub-cluster 2.2b

Sub-cluster 1

D288

D333

D547b

D584

D652 D670a

P36P110

P149

P483

P504

D254 D489 D700a P370aD565 P84a

D511 D642 D662 P91P316

D232a D472 P140a

P243b P469a P387 P72

P102

Figure 4.7: Results of Spectral Clustering (hierarchical with 2 clusters for each level) on

sample 1 (22 sets of answers).


Cluster 1

Cluster 2

Sub-cluster 2.1


Sub-cluster 1

D288

D333

D547b

D584

D652 D670a

P36P110

P149

P483

P504

P102 D254 D700a P370a

D232a D489D565 P72 P84aP316

D472

P140aP243b

P387 P469a

P91D662

D642D511


sample 2 (21 sets of answers).


Cluster 1

Cluster 2

Sub-cluster 2.1

Sub-cluster 1


D232a

D254

D288

D333

D472

D489

D511

D547b

D565

D584

D642

D652

D662

D670a

D700a

P36

P72

P84a

P91

P102

P110

P140a

P149

P243b

P316

P370a

P387

P469a

P483

P504


the complete dataset (43 sets of answers).


−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

Non−Metric MDS (2D)

Coordinate 1

Coo

rdin

ate

2D232a

D254

D288

D333D472

D489

D511

D547b

D565

D584

D642

D652

D662

D670a

D700a

P102

P110

P140a

P149

P243b

P316

P36

P370a

P387

P469a

P483

P504

P72

P84a

P91

Figure 4.10: Results of Spectral Clustering mapped on the output of Multi-Dimensional

Scaling. Red represents cluster 1 while all the other colours are used for sub-clusters

of cluster 2: light blue coding for sub-cluster 1, blue for sub-cluster 2.1, green for sub-

cluster 2.2a and purple for sub-cluster 2.2b.

clustering procedure (cluster 1 in Figure 4.9). All the other colours, on the other hand,

have been used for the single sub-clusters of cluster 2, with light blue coding for sub-

cluster 1, blue for sub-cluster 2.1, green for sub-cluster 2.2a and purple for sub-cluster

2.2b.

It is interesting to note how the dimension represented by the horizontal axis in the

plot, which in section 4.2.1 was hypothesised to reflect the vertical prominence of the

lesion, seems to be far more important than the other one in terms of definition of the

visual classes. This can be seen from the fact that each class tends to be in a sub-

stantially characteristic horizontal position while having a considerable spread in the

vertical direction. It is interesting, moreover, the fact that successive levels in the hi-

erarchy are obtained moving progressively to the right-hand side of the MDS graph.

While this is not entirely surprising as more similar clusters, like those found in lower


levels of the hierarchy, are expected to be next to each other in the MDS plot, it is

rather peculiar to observe such a monotonic movement towards the right-hand side of

the plot. At the time of writing the available data is not enough to support further spec-

ulation about the reasons behind this phenomenon, but the collection of more answers

and the progressive inclusion of new skin-lesion images are expected to provide a bet-

ter understanding of the dynamics which led to it.

Despite the availability of different readily available functions implementing spectral

clustering, it is often difficult to understand from the documentation which of the many

algorithms the code is really implementing. Moreover, given the rather peculiar use

which was made of spectral clustering, skipping completely the first part of the algo-

rithm, an own implementation was developed in R (scripting language for statistical

applications). Implementing the code gave better visibility on the matrix operations

performed and hence allowed to run appropriate tests, like comparing the clusters ob-

tained on two subsets of the data, to verify whether the structure of the similarity matrix

could induce mathematical instability producing potentially unreliable results.

4.3 Discussion

After the presentation of the experiment designed to obtain an estimate of the visual

similarity perceived between different skin-lesion images, and an overview of the web-

application developed to support the data collection, the attention was focused on the

analysis of the gathered data. This analysis had two main purposes. In the first place

the similarity structure emerging from the recorded sample-target matches had to be

studied to verify the presence of meaningful evaluation criteria followed by the people

who took part in the experiment. Secondly a partitioning of the lesions in visual classes

characterised by coherent properties had to be found, as labelled examples were needed

for the supervised training of the automatic classifier presented in the next chapter.

Multi-Dimensional Scaling (MDS) was used to display the similarity structure ex-

pressed by the recorded matches and showed two main facts. First of all, the consis-

tency of the obtained configuration, showing a smooth transaction between extremely

different images, allowed to conclude that people can effectively match together sim-

ilar skin-lesion images even without any guideline defining what makes them similar.

This is a very important finding as it suggests that the extreme variability of the results

obtained in the experiment testing the ABCD rule was not due to the inability of people

to give evaluations, but more likely to the rule itself which seems to guide the assess-


ment in a wrong way. The fact that people could match together similar skin lesions

supports the idea that the innate ability humans have to recognise patterns might be

more useful for diagnostic purposes than guidelines based on an explicit evaluation of

a set of properties, as partly pointed out by Aldridge et al. [6]. This may lead to a

new approach to skin-lesion assessment based more on visual examples than on sets of

properties. The second important outcome obtained from MDS is evidence that people

tend to give very high importance to the vertical prominence of BCCs while evaluat-

ing their similarity. Other criteria are surely part of the evaluation but this property

is by far the most important as underlined by the obtained visual classes. This obser-

vation would suggest that a good classification system should take into consideration

the 3D structure of the lesion, for example through the use of depth data. Due to time

constraints, it will not be possible to include this type of information in the system

proposed in this dissertation. Nonetheless, this addition is an interesting direction for

future research, particularly considering that Li et al. [32] had already highlighted the

importance of 3D data for skin-lesion analysis.

In order to obtain visual classes, an uncommon use of spectral clustering has been

made, following only part of the algorithm in order to take advantage of the form in

which the similarity data have been collected during the experiments. The results show

the presence of two very stable classes: flat and non-flat BCCs. Given the low internal

homogeneity, the class of non-flat BCCs could further be divided in sub-clusters but

the results show only relative stability. This issue, though, is believed to be due to the

limited amount of data currently available, and is expected to disappear, or anyway to

reduce considerably, as more data are stored.

Starting from the defined visual classes, a classification system has been developed. Its

description can be found in the next chapter.

Chapter 5

Automatic Classification System

The literature review presented in section 2.2 showed that over the years very different

approaches have been followed to tackle the problem of automatic classification of skin

lesions. One of the aims of this thesis was proposing a classification system capable of

solving some of the issues observed in those presented in the past.

As previously mentioned, the presented classifier will not be like any of those currently

available. This is due the fact that it will try to reproduce the visual classes defined by

people during the similarity assessment experiments. While no attempt to solve this

problem has been previously reported, this approach is believed to be promising to

obtain better results in the recognition of Basal Cell Carcinomas (BCCs). This, in turn,

would allow to move one step further towards the practical use of Computer-Aided

Diagnosis tools in the field of dermatology.

Each section of this chapter addresses one specific part of the system development,

starting with conceptual design and implementation, and getting to the evaluation of

the obtained results.

5.1 Structure of the System

When designing the system, some of the results of previous research helped in guiding

the implementation choices.

The first and most problematic aspect of the classification of skin lesions is related to

the automatic segmentation of the images. Given the specific characteristics of skin

lesions (e.g. in terms of colour, contrast between lesion and surrounding skin, etc.)

segmentation is in general a hard task (see e.g. Capdehourat et al. [15]). Despite this

intrinsic difficulties, most of the systems presented so far heavily rely on segmenta-

52

Chapter 5. Automatic Classification System 53

tion to compute the features used for classification (see section 2.2 for more details).

The few systems which do not include a segmentation step, e.g [48], work on patches

of images extracted from the internal region of the lesion, possibly missing impor-

tant aspects of the borders. The first implementation choice made was that of creating

a system which would not require any segmentation stage. The advantages of not

performing segmentation are considerable especially considering that BCCs are non-

pigmented lesions and hence tend to be very similar, in colour, to the surrounding skin.

Moreover, this represents an unexplored direction of research an it was worth verifying

whether it could prove promising or not. This choice, though, exposes to the risk of

giving more importance to the properties of the normal skin rather than to those of the

lesion itself.

The second important choice was that of trying to mirror the latest trends in object

recognition systems in terms of their architecture. In recent years, in fact, more and

more attention has been paid to the use of hierarchical architectures in the field of vi-

sion. The main concept is that of stacking a series of different layers one on top of

the other and perform classification on the basis of the output of the last level. While

the classical structure of multi-layer perceptrons can be seen as an instance of this

class, recent advances suggested different ways of organising the layers and alternative

learning strategies. The systems based on such an architecture are generally grouped

under the name of deep learning methods, a category which embraces a wide vari-

ety of approaches. Some popular examples include the work by Prof. LeCun and his

research group (NYU) on convolutional networks (e.g. by LeCun et al. [29], or by

Kavukcuoglu et al.[26]) and some more biologically inspired systems proposed by the

group of Prof. Poggio (MIT), particularly by Serre et al. [49] and by Mutch and Lowe

[39].

A full description of deep learning architectures is not needed to motivate the imple-

mentation choices made and only the main concepts will be presented. For an extensive

introduction to the topic see the monograph by Bengio [10].

The fundamental idea behind deep architectures is that of creating a layered system

which computes, at each level, a set of features based on the output of the lower units.

The main difference with the classical multi-layer perceptron approach lays in the fact

that the features of intermediate levels are learned in an unsupervised fashion, and

only the output of the highest level is used to train a classification system in a super-

vised way. Roughly speaking, this approach lets the intermediate features be generated

on the basis of specific traits of the training image-set. As the information is propa-


gated through the architecture, the higher-level features are expected to capture the

co-occurrence patterns of those belonging to the next lower layer, resulting in better

descriptors of the specific set of images under analysis. The output of the top level is

then supposed to be highly informative about the variations typically occurring in the

training image-set and can be used for classification.

Different systems adopt different strategies to learn the intermediate features, but the

main point characterising this kind of techniques is that they try to pave a new way

to object recognition where the features used to discriminate between the classes are

not imposed by the researcher, but emerge autonomously. The descriptors obtained

through this procedure should reflect more closely the low-level properties of the im-

ages which otherwise could be lost. In a way this approach can be seen as moving a

step back and letting the machine learn to see instead of imposing the researcher’s idea

about how and what the machine should see.

For this reason, the idea of implementing an hierarchical system seemed appropriate

to solve the task at hand as automatically learning a low-level representation of the

skin-lesion images is expected to give better descriptors than those crafted by humans.

This is particularly true in our case since the aim of the system is replicating classes

which have been created by people evaluating visual similarity.

Another characteristic which makes these architectures particularly attractive is that

they do not require any segmentation of the image as reported by LeCun et al. [29]

and Serre et al. [49]. Considering that segmentation is one of the main issues when

dealing with skin-lesion images, systems which do not require it could provide consid-

erable advantages.

As previously mentioned, deep learning is a framework within which very different

implementations can be found. The remaining part of this section will present the im-

plementation choices made when developing the two main parts of the system: the

feature extraction stage and the classification stage.

5.1.1 Feature Extraction Stage

The feature extraction stage is of crucial importance in every hierarchical architecture

as it is the part which is supposed to capture the structure of the presented image.

Many object recognition systems have already been proposed, each having its peculiar

feature extractor. The differences are not only to be found in the type of the computed

features, but in the number of levels in the hierarchy and in the way the intermediate


layers combine the information coming from lower levels to obtain new features. The

systems proposed by Serre et al. [49] and Mutch and Lowe [39], for example, are

inspired by the ventral stream of primates visual cortex. They start, at the lowest level,

with features representing responses of image patches to Gabor filters. Non-linear op-

erations are then performed as the level increases, always pooling together features of

the next lower layer. The result is a progressive widening of the “receptive field” char-

acterising the feature detectors. A similar approach is followed by LeCun et al. [29]

and Kavukcuoglu et al. [26], who focus less on the biological plausibility of their sys-

tems and use convolutional networks as the basis of their architectures. In this case the

features are learned with energy-based methods, a non-probabilistic learning strategy

which grants better performance in situations where a normalised probability measure

is not required as an output (see LeCun et al. [28] for a self-contained tutorial).

The system proposed by Kavukcuoglu et al. [26] is of particular interest as it learns a

set of features which are pooled in neighbourhoods in such a way to guarantee some

degree of invariance to the transformations observed in the images of the training set.

Despite the interesting peculiarities of the system and the good performance claimed

by the authors, the architecture has been trained and tested on image-sets either includ-

ing handwritten digits or pictures of objects. The problem with these type of images is

that they are extremely different from skin lesions, as they generally show much more

structure and often sharper contrast between the object and the background. For this

reason it was not possible to guarantee that the system would have worked comparably

well on skin-lesion images. The best alternative was found in the work by Hyvarinen

et al. [25] who proposed several statistical models to describe natural images. Even

though natural images are not a good approximation of skin-lesion images either, they

are much more similar than those of objects and hence this approach was considered

more likely to succeed.

Among the various proposed in [25], the technique chosen to implement the first

layer of the feature extraction stage was topographic Independent Component Anal-

ysis (tICA). Even though statistical details are extremely important, they cannot be

included here and the reader is encouraged to refer to [25] for getting a better under-

standing about the technical aspects.

The concept behind tICA is extending standard Independent Component Analysis so

that the topological organisation of the visual cortex is reproduced. Basically, a set of

features detectors are organised on a 2D grid and statistical dependency is imposed on

the output of the detectors belonging to the same neighbourhood (in our case a 5× 5


square). Imposing this statistical dependency is conceptually equivalent to forcing

nearby detectors to respond to similar features. The features are then learned through

a modified version of the standard Independent Component Analysis operating on a

set of patches extracted from the images of the training set. The obtained result is a

set of feature detectors organised on a topological grid according to the similarity of

the underlying feature. It is important to underline that imposing similarity through

statistical dependency of the output of the feature detectors does not define any spe-

cific concept of similarity. Conversely, features which are often active together will

be considered similar and hence grouped in the same neighbourhood. The concept of

similarity is therefore entirely learned from the images of the training set (see [25] for

more details).

Two different strategies have been used to compute the outputs of the feature detectors.

In the first place the output s′ of the i-th feature detector was computed as

s′i =T

∑j=1

wTi p j

where wTi is the vector obtained by arranging Wi (matrix storing the coefficients of

the i-th feature detector) in a single row, p j is the j-th patch of the image organised

in a column vector, and T is the total number of patches of fixed size which can be

extracted from the image. The output s′i is, in other words, the total activation of the

i-th feature detector passed on the whole image. An alternative formulation has also

been used with s′i being

s′i = maxj(wT

i p j)

with all the variables having the same meaning as above. In this case s′i represents the

maximum activation level reached by the i-th feature detector on the analysed image.

The different results obtained are commented on in section 5.2.

Even though various authors report an increase in the performance obtained as the

number of layers increases [39, 49], the inner dynamics of the system becomes quickly

difficult to understand. For this reason it was decided to limit the architecture to 2

layers.

The second layer was designed to pool the output of neighbouring first-layer feature

detectors. The pooling was performed on 4×4 squares organised to partially overlap

with each other. The activation s′′ of the z-th element of the second layer is computed

as

s′′z =

√∑p(z)

s′p2


where p(z) represents the set of feature detectors of layer 1 belonging to the pooling

square of the z-th element of layer 2.

Given the layout of the pooling squares, even layer 2 is configured as a 2D grid, having

the side half of the size of that of layer 1.

5.1.2 Classification Stage

In order to implement the classification stage, a Gaussian Process (GP) have been

preferred over more common approaches based on Support Vector Machines [26, 49].

GPs are sophisticated probabilistic objects and cannot be fully covered in a short para-

graph. Only the main features of GPs will be presented here, for a full coverage please

refer to the work by Rasmussen and Williams [43]

There are three main reasons which led to the choice of using a GP.

First of all, GPs are probabilistic techniques and hence perform classification in a prob-

abilistic way. This means that the probability of the different classes is returned and

not just the chosen class. The advantages of obtaining such an output are various and

include

1. the possibility of evaluating intermediate classes (e.g when the probability of two

classes is similar)

2. the possibility of implementing a reject option when classification is not confident

enough

3. the possibility of changing the class-assignment rule without learning the classi-

fier from scratch (e.g. including additional criteria such as a cost or risk function)

The possibility of implementing a reject option is particularly well suited for a medical

setting where everything which cannot be classified with a certain level of confidence

should be marked as suspicious.

Secondly, GPs implement a fully Bayesian approach to inference and hence the learn-

ing process is not affected by the risk of over-fitting.

Finally, GPs have the ability to perform automatic relevance determination. In other

words, during the training phase they can learn in a principled way which of the input

variables are relevant for classification and which are not. Given the short time avail-

able for experiments this feature was of great help.

The code used for implementing the GP is the one supplied with [43].


Figure 5.1: Features obtained through topographic Independent Component Analysis.

White areas correspond to positive coefficients, black areas to negative ones. Areas in

middle-grey represent coefficients equal to zero.

5.2 Experiments and Evaluation

In order to evaluate the performance of the proposed system, an experimental frame-

work has been set up.

As pointed out in section 4.2.2, only the first level of the hierarchical visual classes

seems to be reliable enough with the available data. For this reason, the classifier

was trained to reproduce the first two visual classes, while learning the other levels of

the hierarchy has been postponed in order to wait for more data. The classes will be

henceforth referred to as the flat and non-flat Basal Cell Carcinomas (BCCs), referring

respectively to cluster 1 and 2 presented in section 4.2.2.

Given the limited amount of images classified by people, it was decided to keep 10

images for each class as a training set and use the rest as an independent test set. This

approach resulted in having only one image as a test case for the flat BCCs, but keep-

ing less than 10 training examples for each class would have been inappropriate to get

reliable results.

The feature extraction stage was trained running tICA on 10,000 square patches of


Lesion True Class P(flat) P(non-flat)D670a flat 0.8917 0.1083

D232a non-flat 0.1779 0.8221

D254 non-flat 0.1849 0.8151

D472 non-flat 0.9436 0.0564

D489 non-flat 0.7747 0.2253

D642 non-flat 0.2274 0.7726

P102 non-flat 0.4545 0.5455

P140a non-flat 0.4545 0.5455

P469a non-flat 0.4542 0.5458

P72 non-flat 0.6536 0.3464

Table 5.1: Results obtained on the image test set. True Class represents the visual

class the lesion belongs to, P(flat) and P(non-flat) the probabilities assigned to the two

classes by the Gaussian process

side 40 pixels. The patches were extracted from 200× 200 pixels images containing

unsegmented skin lesions, sampling with higher probability the central area where the

lesions were located. The original colour images where converted in CIELAB colour-

space [47] and only the L plane was kept. A total of 256 independent components

were required, generating a topographic structure corresponding to a 16×16 grid with

neighbourhoods being 5× 5 squares. The second layer was implemented to pool the

output of feature detectors laying in 4×4 partially overlapping squares. The obtained

first-layer features are shown in Figure 5.1.

As for the classification stage, two training scenarios have been tested providing the

GP with different inputs:

1. the square of the activation of the feature detectors s′2 (256 variables)

2. the pooling of the activation of neighbouring detectors s′′ (64 variables)

which respectively represent the output of the first and second layer of the features ex-

traction stage. The reason behind the squaring in option 1 has to be found in statistical

considerations regarding the tICA model, full details can be found in [25].

As mentioned in section 5.1.1, two different formulas have been used to compute the

output of the feature detectors. As it turned out, the formulation based on the total

activation of the detector on the whole image (s′i = ∑Tj=1 w′ip j) made it impossible to

the GP to learn anything. While the theoretical reasons behind this are not clear, the


(a) D472 (b) D489 (c) P72

(d) P102 (e) P140a (f) P469a

Figure 5.2: Wrong assignments to the flat BCC class (upper row) and low-confidence

correct assignments to the non-flat BCC class (lower row).

only possible conclusion is that these features are not enough informative to learn to

perform classification. As a result, the GP assigns a 0.5 probability to each of the

classes for every presented input. The same lack of learning occurred when using the

alternative formulation (s′i = max j(wTi p j)) and feeding the GP with the 256-variable

vector. When using this latter formulation and passing to the GP the output of the

second layer (64-variable vector), on the other hand, learning took place. The results

obtained on the test set by the trained system are presented in table 5.1. If we consider

a classification confident only when the probability of one class is over 55%, table 5.1

reports 3 errors and 3 low-confidence correct cases. To facilitate the interpretation of

the results these 6 lesions are shown in Figure 5.2.

As pointed out before, the test set is far from being ideal. Given its extreme unbal-

anced presence of flat and non-flat cases, for example, it is impossible to evaluate

accurately the performance of classification for flat BCCs. Some general conclusions

can be drawn nonetheless.

Despite the limited amount of training samples, the system managed to obtain reason-

able results for the non-flat BCCs. If we ignore the concept of low-confidence answers,

as it would happen for non-probabilistic classifiers, only 3 out of 9 non-flat lesions


were misclassified. Interestingly, by verifying their position in the layout obtained by

Multi-Dimensional Scaling (see Figure 4.10 for a clear display) it is easy to note that

these three lesions (D472, D489 and P72) are very close to one another. Such a near-

ness means that people have indeed considered them similar as the system apparently

did. Furthermore, by visual inspection of Figure 5.2(b), several traits typical of the flat

BCC class can be found in D489. This could partially justify the misclassification.

5.3 Discussion

In this chapter the classification system developed to replicate the visual classes of

Basal Cell Carcinomas created by human observers has been presented. Even though

it would not be fair to claim success on the basis of results obtained on such a limited

amount of test data, the outcomes seem to be promising. As more data will be col-

lected, the size of the visual classes will grow and a more thorough data analysis will

be possible. Given the obtained results, though, the fact that 7 out of the 10 test cases

were correctly classified suggests that the system is not performing merely random

classification and possibly learned some relevant trait of the visual classes. Whether

only the non-flat class or both of them were satisfactorily learned is, at the moment,

impossible to evaluate. Interestingly, the misclassified lesions happen to have high

similarity according to human observers, suggesting that the system is indeed able to

find some of the criteria guiding the visual similarity assessment. Whether this is true

or the situation is due to peculiarities of the training and test sets will be the subject for

further research.

Chapter 6

Conclusions

The aims of this dissertation were trying to get a better understanding about how peo-

ple assess skin lesions and exploring new ways of using this perceptual information to

develop better classification systems. In order to do so, the entire project was divided

in three successive stages. In the first one, the ability of observers to evaluate given

properties of skin-lesion images has been assessed. The second stage tested how well

people could group skin lesions only considering their visual similarity. In this phase

no definition of similarity was provided and hence the observers were forced to look

for common traits in the different images. Starting from people’s evaluations, classes

of visually similar lesions were derived. This allowed to assess both the ability of

people to identify commonalities between skin-lesion images and the stability of these

similarity criteria across different observers. The last phase consisted in developing an

automatic system capable of learning the similarity criteria followed by people while

defining the visually homogeneous groups.

During the first phase evidence emerged supporting the hypothesis that people cannot

effectively evaluate properties of skin-lesion images despite being given fixed visual

references. These results put under serious questioning dermatological guidelines like

the ABCD rule which assume a substantial uniformity in the way people evaluate spe-

cific characteristics of the lesions. While analysing the data, furthermore, the emer-

gence of correlation patterns between different properties suggested that automatic

classifiers should not be based on the ABCD rule.

The results obtained in the second part of the project showed that when evaluating skin

lesions as a whole (i.e. not according to single properties) observers can effectively as-

sess their visual similarity even when no specific definition of similarity is given. The

high inter-observer homogeneity of the assessment suggests that guidelines based on

62

Chapter 6. Conclusions 63

visual samples might provide better results in the classification of skin lesions, espe-

cially when the task is performed by laypeople. At the time of writing, reliable results

have been obtained for only 30 of the 115 Basal Cell Carcinomas whose images are

available. Once additional feedback is obtained from the users it will be possible to

gain a more complete insight into the similarity assessment process. An interesting

topic for future research will certainly be unravelling the hierarchy of criteria followed

by observers when evaluating visual similarity of skin lesions. The first step has been

moved in this dissertation showing the importance of vertical prominence, but further

research is required for the following hierarchical levels.

Overall, the first two phases of the dissertation were successful and previously un-

reported conclusion have been drawn. Far from providing a full understanding of a

complex task like visual similarity assessment of skin lesions, this dissertation surely

provided new results offering a good starting point for future work.

In the last phase of the project, the findings regarding visual classes have been em-

ployed in the development of an automatic classification system designed to repli-

cate the similarity criteria followed by people. The system was designed to learn ev-

erything from the data, including the features used to characterise the skin lesions.

This approach was aimed at avoiding any bias which could have been introduced by

human-crafted descriptors. No system trying to classify skin-lesions according to vi-

sual classes defined through perceptual experiments has been proposed before, and the

one presented in this thesis represents the first attempt to explore this field. Due to

the limited amount of data available at the time of writing, though, a thorough evalua-

tion of the classification performance was impossible. The results are anyway looking

promising and new test data are expected to come available in the near future. In its

current configuration the proposed system does not take into account any information

regarding colour and 3D structure. As these two aspects are of utmost importance in

the humans’ visual system, future research should be addressed to integrate them into

the classifier. In particular, integration of depth information is expected to have a con-

siderable impact on the classification performance since the most important similarity

criterion emerging from Multi-Dimensional Scaling results, which seems to be vertical

prominence, can be fully captured only through the use of 3D data.

In conclusion, this dissertation answered most of the questions it was founded on and

reported new results which could possibly act as a stem for future research.

Bibliography

[1] 2005 IPAM graduate summer school: Intelligent extrac-tion of information from graphs and high dimensional data.http://www.ipam.ucla.edu/schedule.aspx?pc=gss2005, Accessed on 14/08/2010.

[2] http://homepages.inf.ed.ac.uk/rbf/dermofit/. Accessed on 16/08/2010.

[3] http://www.ggobi.org. Accessed on 22/08/2010.

[4] http://www.skincarephysicians.com/skincancernet/melanoma.html. Accessed on16/08/2010.

[5] http://www.vision.caltech.edu/image datasets/caltech101/. Accessed on22/08/2010.

[6] R. B. Aldridge, R. Fisher, L. Ballerini, K. Robertson, Y. Bisset, and J. L. Rees.Do laypersons have intrinsic pattern recognition abilities that could be harnessedto allow the accurate and early diagnosis of skin cancers? British Journal ofDermatology, 162(4):949–950, 2010.

[7] G. Argenziano, G. Fabbrocini, P. Carli, V. de Giorgi, E. Sammarco, andM. Delfino. Epiluminescence microscopy for the diagnosis of doubtfulmelanocytic skin lesions: comparison of the abcd rule of dermatoscopy anda new 7-point checklist based on pattern analysis. Archives of Dermatology,134(12):1563–1570, December 1998.

[8] L. Ballerini, X. Li, R. B. Fisher, and J. Rees. A query-by-example content-based image retrieval system of non-melanoma skin lesions. In Proceedings ofMICCAI-09 Workshop MCBR CBS 2009, volume 5853 of LNCS, pages 10–17,2009.

[9] M. S. Bartlett. Properties of sufficiency and statistical tests. In Proceedings ofthe Royal Statistical Society, volume 160 of A, pages 268–282, 1937.

[10] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Ma-chine Learning, 2(1):1–127, 2009.

[11] I. Borg and P. Groenen. Modern Multidimensional Scaling: theory and applica-tions. Springer-Verlag, New York, second edition, 2005.

[12] I. Brace. Questionnaire Design. Kogan Page, second edition, 2008.

[13] R. Brandstrom, M. Hedblad, I. Krakau, and H. Ullen. Laypersons’ perceptualdiscrimination of pigmented skin lesions. Journal of the American Academy ofDermatology, 46(5):667–673, May 2002.

[14] M. B. Brown and A. B. Forsythe. Robust tests for equality of variances. Journalof the American Statistical Association, 69:364–367, 1974.

64

Bibliography 65

[15] G. Capdehourat, A. Corez, A. Bazzano, and P. Muse. Pigmented skin lesions clas-sification using dermatoscopic images. In CIARP 2009, volume 5856 of LNCS,pages 537–544, 2009.

[16] M. E. Celebi, H. Iyatomi, G. Schaefer, and W. V. Stoecker. Lesion border de-tection in dermoscopy images. Computerized Medical Imaging and Graphics,33:148–153, 2009.

[17] M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi, Y. A. Aslandogan, W. V.Stoecker, and R. H. Moss. A methodological approach to the classification ofdermoscopy images. Computerized Medical Imaging and Graphics, 31:362–373,2007.

[18] M. A. Chaudhry, R. Ashraf, M. N. Jafri, and M. Akbar. Computer aided diag-nosis of skin carcinomas based on textural characteristics. In Proceedings of theInternational Conference on Machine Vision, pages 125–128. IEEE, 2007.

[19] F. Chung. Spectral Graph Theory, volume 92 of CBMS Regional ConferenceSeries in Mathematics. Conference Board of the Mathematical Sciences, Wash-ington, 1997.

[20] D. A. Clausi. An analysis of co-occurrence texture statistics as a function of greylevel quantization. Canadian Journal of Remote Sensing, 28(1):45–62, 2002.

[21] K. Clawson, P. Morrow, B. Scotney, J. McKenna, and O. Dolan. Analysis ofpigmented skin lesion border irregularity using the harmonic wavelet transform.In Proceedings of the 13th International Machine Vision and Image ProcessingConference, pages 18–23. IEEE, 2009.

[22] G. W. Corder and D. I. Foreman. Nonparametric Statistics for non-statisticians.John Wiley and Sons, 2009.

[23] R. J. Friedman, D. S. Rigel, and A. W. Kopf. Early detection of malignantmelanoma: the role of physicians examination and self examination of the skin.CA Cancer Journal for Clinicians, 35:130–151, 1985.

[24] S. Gunasti, M. K. Mulayim, B. Fettahlioglu, A. Yucel, R. Burgut, Y. Sertdemir,and V. L. Aksungur. Interrater agreement in rating of pigmented skin lesions forborder irregularity. Melanoma research, 18:284–288, 2008.

[25] A. Hyvarinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics. Springer-Verlag, 2009.

[26] K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. LeCun. Learning invariantfeatures through topographic filter maps. In Proceedings of the InternationalConference on Computer Vision and Pattern Recognition (CVPR’09). IEEE,2009.

[27] N. Laskaris. Fuzzy description of skin lesion images. Master’s thesis, School ofInformatics – University of Edinburgh, 2009.

[28] Y. LeCun, S. Chopra, R. Hadsell, M. A. Ranzato, and F. J. Huang. A tutorialon energy-based learning. In G. Bakir, T. Hofman, B. Scholkopf, A. Smola, andB. Taskar, editors, Predicting Structured Data. MIT Press, 2006.

Bibliography 66

[29] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic objectrecognition with invariance to pose and lighting. In roceedings of the Inter-national Conference on Computer Vision and Pattern Recognition (CVPR’04).IEEE Press, 2004.

[30] H. Levene. Robust tests for equality of variances. In I. Olkin, editor, Contri-butions to probability and statistics: essays in honor of Harold Hotelling, pages278–292. Stanford University Press, Stanford, CA, 1960.

[31] M. Li, G. Anzhe, Z. Shaofang, and X. Weidong. Irregularity and asymmetry anal-ysis of skin lesions based on multi-scale local fractal distributions. In Proceedingsof the 2nd International Congress on Image and Signal Processing, pages 1–5.IEEE, 2009.

[32] X. Li, B. Aldridge, L. Ballerini, B. Fisher, and J. Rees. Depth data improvesskin lesion segmentation. In Proceedings of the 12th International Conferenceon Medical Image Computing and Computer-Assisted Intervention, volume 5762of LNCS, pages 1100–1107, 2009.

[33] I. Maglogiannis and C. N. Doukas. Overview of advanced computer vision sys-tems for skin lesions characterization. IEEE Transactions on Information Tech-nology in Biomedicine, 13(5):721–733, September 2009.

[34] S. McDonagh, R. B. Fisher, and J. Rees. Using 3d information for classifica-tion of non-melanoma skin lesions. In Proc. Medical Image Understanding andAnalysis, pages 164–168, Dundee, 2008.

[35] C. S. Mendoza, C. Serrano, and B. Acha. Scale invariant descriptors in patternanalysis of melanocytic lesions. In Proceedings of the 16th IEEE InternationalConference on Image Processing, pages 4193–4196, 2009.

[36] M. Messadi, A. Bessaid, and A. Taleb-Ahmed. Extraction of specific parametersfor skin tumour classification. Journal of Medical Engineering and Technology,33(4):288–295, 2009.

[37] L. J. Meyer, M. Piepkorn, D. E. Goldgar, C. M. Lewis, L. A. Cannon-Albright,J. J. Zone, and M. H. Skolnick. Interobserver concordance in discriminating clin-ical atypia of melanocytic nevi, and correlations with histologic atypia. Journalof the American Academy of Dermatology, 34(4):618–625, April 1996.

[38] A. Mojsilovic, J. Kovacevic, J. Hu, R. J. Sarfranek, and S. K. Ganapathy. Match-ing and retrieval based on vocabulary and grammar of color patterns. IEEE Trans-actions on Image Processing, 9(1):38–54, January 2000.

[39] J. Mutch and D. G. Lowe. Object class recognition and localization using sparsefeatures with limited receptive fields. International Journal of Computer Vision,80(1):45–57, October 2008.

[40] F. Nachbar, W. Stolz, T. Merkle, A. B. Cognetta, T. Vogt, M. Landthaler, P. Bilek,O. Braun-Falco, and G. Plewig. The ABCD rule of dermatoscopy. Journal of theAmerican Academy of Dermatology, 30:551–559, 1994.

[41] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and analgorithm. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances inNeural Information Processing Systems 14, pages 849–856. MIT Press, 2002.

Bibliography 67

[42] S. M. Rajpara, A. P. Botello, J. Townend, and A. D. Ormerod. Systematic reviewof dermoscopy and digital dermoscopy/artificial intelligence for the diagnosis ofmelanoma. British Journal of Dermatology, 161:591–604, 2009.

[43] C.E. Rasmussen and C.K.I Williams. Gaussian Processes for Machine Learning.MIT Press, 2006.

[44] K. Reetz Muller, R. Rangel Bonamigo, T. Antoniolli Crestani, G. Chiaradia, andM. C. Widholzer Rey. Evaluation of patients’ learning about the abcd rule: arandomized study in southern brazil. Anais Brasileiros de Dermatologia, 84(6),November/December 2009.

[45] J. L. Rodgers and W. A. Nicewander. Thirteen ways to look at the correlationcoefficient. The American Statistician, 42(1):59–66, February 1988.

[46] B. E. Rogowitz, T. Frese, J. R. Smith, C. A. Bouman, and E. Kalin. Perceptualimage similarity experiments. In B. E. Rogowitz and Pappas T. N., editors, Hu-man Vision and Electronic Imaging III, Proceedings of the SPIE, 3299, San Jose,CA, January 1998.

[47] J. Schanda, editor. Colorimetry: understanding the CIE system. Wiley-Interscience, 2007.

[48] C. Serrano and B. Acha. Pattern analysis of dermoscopic images based on markovrandom fields. Pattern Recognition, 42:1052–1057, 2009.

[49] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust objectrecognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysisand Machine Intelligence, 29(3):411–426, 2007.

[50] A. She, Y. Liu, and A. Damatoa. Combination of features from skin pattern andabcd analysis for lesion classification. Skin Research and Technology, 13(1):25–33, February 2007.

[51] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon.Sci, 1:801–804, 1956.

[52] K. Tabatabaie, A. Estek, and P. Toossi. Extraction of skin lesion texture fea-tures based on independent component analysis. Skin Research and Technology,15:433–439, 2009.

[53] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing,17(4):394–416, December 2007.

[54] B. L. Welch. The generalization of student’s problem when several differentpopulation variances are involved. Biometrika, 34(1–2):28–35, 1947.

Documents

Visual Description of Skin Lesions - School of Informatics