Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen Prototype-based learning and

Michael BiehlJohann Bernoulli Institute for Mathematics and Computer ScienceUniversity of Groningen

www.cs.rug.nl/biehl

Prototype-based learning and adaptive distances for

classification

Brain Inspired Computing, Cetraro, July 2013 2

overview

Basic concepts of similarity / distance based classification

example system: Learning Vector Quantization (LVQ)

application: Classification of Adrenal Tumors

Distance measures and Relevance Learning

predefined distances, e.g. divergence based LVQ

application: Detection of Cassava Mosaic Disease

adaptive distances, e.g. Matrix Relevance LVQ

application: Classification of Adrenal Tumors (cont’d)

extensions: combined distances, relational data

(excursion: uniqueness and regularization of relevance matrices)

Part I: Basic concepts of distance/similarity based

classification


classification problems

- character/digit/speech recognition

- medical diagnoses

- pixel-wise segmentation in image processing

- object recognition/scene analysis

- fault detection in technical systems

- remote sensing

...

machine learning approach:

extract information from example data

parameterized in a learning system (neural network, LVQ, SVM...)

working phase: application to novel data

here only: supervised learning , classification:


distance based classification

assignment of data (objects, observations,...)

to one or several classes (crisp/soft) (categories, labels)

based on comparison with reference data (samples, prototypes)

in terms of a distance measure (dis-similarity, metric)

representation of data (a key step!)

- collection of qualitative/quantitative descriptors

- vectors of numerical features

- sequences, graphs, functional data

- relational data, e.g. in terms of pairwise (dis-) similarities

Brain Inspired Computing, Cetraro, July 2013

K-NN classifier

a simple distance-based classifier

- store a set of labeled examples

- classify a query according to the label of the Nearest Neighbor (or the majority of K NN)

- local decision boundary acc. to (e.g.) Euclidean distances

?

- piece-wise linear class borders parameterized by all examples

feature space

+ conceptually simple, no training required, one parameter (K)

- expensive storage and computation, sensitivity to “outliers”can result in overly complex decision boundaries


prototype based classification

a prototype based classifier [Kohonen 1990, 1997]

- represent the data by one or several prototypes per class

- classify a query according to the label of the nearest prototype (or alternative schemes)

- local decision boundaries according to (e.g.) Euclidean distances

- piece-wise linear class borders parameterized by prototypes

feature space

?

+ less sensitive to outliers, lower storage needs, little computationaleffort in the working phase

- training phase required in order to place prototypes,model selection problem: number of prototypes per class, etc.


set of prototypes

carrying class-labels

based on dissimilarity/distance measure

nearest prototype classifier (NPC):

given - determine the winner

- assign to the class

most prominent example: (squared) Euclidean distance

Nearest Prototype Classifier

xx

reasonable requirements:


∙ identification of prototype vectors from labeled example data

∙ distance based classification (e.g. Euclidean)

Learning Vector Quantization

N-dimensional data, feature vectors

• initialize prototype vectors for different classes

heuristic scheme: LVQ1 [Kohonen, 1990, 1997]

• identify the winner (closest prototype)

• present a single example

• move the winner

- closer towards the data (same class)

- away from the data (different class)


∙ identification of prototype vectors from labeled example data

∙ distance based classification (e.g. Euclidean)

Learning Vector Quantization

N-dimensional data, feature vectors

∙ tesselation of feature space [piece-wise linear]

∙ distance-based classification [here: Euclidean distances]

∙ generalization ability correct classification of new data

∙ aim: discrimination of classes ( ≠ vector quantization or density estimation )


sequential presentation of labelled examples

… the winner takes it all:

learning rate

many heuristic variants/modifications: [Kohonen, 1990,1997]

- learning rate schedules ηw (t) [Darken & Moody, 1992]

- update more than one prototype per step

iterative training procedure:

randomized initial , e.g. close to the class-conditional means

LVQ1

LVQ1 update step:


LVQ1 update step:

LVQ1-like update forgeneralized distance:

requirement:

update decreases (increases) distance if classes coincide (are different)

LVQ1


Generalized LVQ

one example of cost function based training: GLVQ [Sato & Yamada, 1995]

sigmoidal (linear for small arguments), e.g.

E approximates number of misclassifications

linear

E favors large margin separation of classes, e.g.

two winning prototypes:

minimize


GLVQ

training = optimization with respect to prototype position,

e.g. single example presentation, stochastic sequence of examples,

update of two prototypes per step

based on non-negative, differentiable distance


GLVQ




based on non-negative, differentiable distance


GLVQ




based on Euclidean distance

moves prototypes towards / away from

sample with prefactors


+ frequently applied in a

variety of practical problems

+ intuitive interpretation

prototypes defined in feature space

+ natural for multi-class problems

- often based on purely heuristic arguments … or …

cost functions with unclear relation to classification error

Important issue: which is the ‘right’ distance measure ?

prototype/distance based classifiers

- model/parameter selection (# of prototypes, learning rate, …)

features may- scale differently - be of completely different nature - be highly correlated / dependent …

simple Euclidean distance ?

+ flexible, easy to implement


related schemes

Many variants of LVQ

intuitive schemes: LVQ2.1, LVQ3, OLVQ, ...

cost function based: RSLVQ (likelihood ratios)

Supervised Neural Gas (NG)

many prototypes, rank based update

Supervised Self-Organizing Maps (SOM)

neighborhood relations, topology preserving mapping

Radial Basis Function Networks (RBF)

hidden units = centers (prototypes) with Gaussian activation


remark: the curse of dimension ?

concentration of distances for large N

„distance based methods are bound to fail in high dimensions“ ???

LVQ:

- prototypes are not just random data points

- carefully selected low-noise representatives of the data

- distances of a given data point to prototypes are compared

projection to non-trivial

low-dimensional subspace!

[Ghosh et al., 2007, Witoelar et al., 2010]

models of LVQ training, analytical treatment in the limit

successful training needs training examples

see also:


Questions ?

?

An example problem: classification of adrenal tumors

Wiebke Arlt , Angela TaylorDave J. Smith, Peter Nightingale P.M. Stewart, C.H.L. Shackleton et al.

Petra SchneiderHan Stiekema Michael Biehl

Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen

School of MedicineQueen Elizabeth HospitalUniversity of Birmingham/UK(+ several centers in Europe)

tumor classification

[Arlt et al., J. Clin. Endocrinology & Metabolism,

2011]


∙ adrenal tumors are common (1-2%)

and mostly found incidentally

∙ adrenocortical carcinomas (ACC) account

for 2-11% of adrenal incidentalomas

( ACA: adrenocortical adenomas )

∙ conventional diagnostic tools lack sensitivity

and are labor and cost intensive (CT, MRI)

www.ensat.org

adrenal gland

∙ idea: tumor classification based on steroid excretion profile



- urinary steroid excretion (24 hours) - 32 potential biomarkers - biochemistry imposes correlations, grouping of steroids



ACA patient #

ACC patient #

# steroid marker

102 patients with benign ACA

45 patients with malignant ACC

color coded excretion values(log. scale, relative to healthy controls)

data set:



Generalized LVQ , training and performance evaluation

∙ data divided in 90% training and 10% test set

∙ determine prototypes by stochastic gradient descent

typical profiles (1 per class)

∙ apply classifier to test data

evaluate performance (error rates)

∙ employ Euclidean distance measure

in the 32-dim. feature space

∙ repeat and average over many random splits



ACA

ACC

prototypes: steroid excretion in ACA/ACC



∙ Receiver Operator Characteristics (ROC) [Fawcett, 2000]

obtained by introducing a biased NPC:

false positive rate(1-specificity)

true p

osi

tive r

ate

(s

ensi

tivit

y)

θ = 0

rand

om g

uess

ing

Area under Curve

all tumors classified as ACA

- no false alarms

- no true positives detected

all tumors classified as ACC

- all true positives detected

- max. number of false alarms



ROC characteristics (averaged over splits of the data set)

AUC=0.87

GLVQ performance:



Questions ?

?

Part II: distance measures and relevance learning


distance measures

fixed distance measures:

- select distance measures according to prior knowledge

- data driven choice in a preprocessing step

- determine prototypes for a given distance

- compare performance of various measures

example: divergence based LVQ


Relevance Matrix LVQ

generalized quadratic distance in LVQ:

variants: one global, several local, class-wise relevance matrices → piecewise quadratic decision boundaries

rectangular discriminative low-dim. representation e.g. for visualization [Bunte et al., 2012]

possible constraints: rank-control, sparsity, …

normalization:

diagonal matrices: single feature weights [Bojer et al., 2001] [Hammer et al., 2002]

[Schneider et al., 2009]



optimization of prototypes and distance measure

WTA

Matrix-LVQ1



Generalized

Matrix-LVQ

(gradients of )

optimization of prototypes and distance measure


heuristic interpretation

summarizes

- the contribution of the original dimension

- the relevance of original features for the classification

interpretation assumes implicitly:

features have equal order of magnitude

e.g. after z-score-transformation →

(averages over data set)

standard Euclidean distance for

linearly transformed features



optimization of prototype positions

distance measure(s) in one training process (≠ pre-processing)

motivation:

improved performance - weighting of features and pairs of features

simplified classification schemes - elimination of non-informative, noisy features - discriminative low-dimensional representation

insight into the data / classification problem - identification of most discriminative features - incorporation of prior knowledge (e.g. structure of Ω)


Generalized Matrix LVQ , ACC vs. ACA classification

∙ data divided in 90% training, 10% test set, (z-score transformed)

∙ determine prototypes

typical profiles (1 per class)

∙ apply classifier to test data

evaluate performance (error rates, ROC)

∙ adaptive generalized quadratic distance measure

parameterized by

∙ repeat and average over many random splits

tumor classification (cont’d)

[Arlt et al., 2011][Biehl et al., 2012]


off-diagonaldiagonal elements

fraction of runs (random splits) in which asteroid is rated among 9 most relevant markers

subset of 9 selected steroids ↔ technical realization (patented, University

of Birmingham/UK)


Relevance matrix



19

ACA

ACCdiscriminative e.g. steroid 19




8

ACA ACC

non-trivial role:steroid 8 among the most relevant!



highly discriminativecombination of markers!

weakl

y d

iscr

imin

ati

ve m

ark

ers

12

8



ROC characteristics

clear improvement due to

adaptive distances

(1-specificity)

(s

ensi

tivit

y)

8

GMLVQ

GRLVQ

diagonal rel.Euclidean

full matrix

AUC0.870.930.97



observation / theory :

low rank of resulting relevance matrix

often: single relevant eigendirection

eigenvaluesin ACA/ACCclassification

intrinsic regularization

nominally ~ NxN adaptive parameters in Matrix LVQ

reduce to ~ N effective degrees of freedom

low-dimensional representation

facilitates, e.g., visualization of labeled data sets


Stationarity of Matrix Relevance LVQ

[M. Biehl, B. Hammer, F.-M. Schleif, T. Villmann,

IJCNN 2015, in press]



visualization of the data set

ACAACC


projection on first eigenvector

pro

ject

ion o

n s

eco

nd e

igenvect

ora multi-class example

classification of coffee samples

based on hyperspectral data

(256-dim. feature vectors)

[U. Seiffert et al., IFF Magdeburg]

prototypes


related schemes

Relevance LVQ variants

local, rectangular, structured, restricted... relevance matrices

for visualization, functional data, texture recognition, etc.

relevance learning in Robust Soft LVQ, Supervised NG, etc.

combination of distances for mixed data ...

Relevance Learning related schemes in supervised learning ...

RBF Networks [Backhaus et

al., 2012]

Neighborhood Component Analysis [Goldberger et

al., 2005]

Large Margin Nearest Neighbor [Weinberger et al., 2006,

2010]

and many more!

Linear Discriminant Analysis (LDA)

one prototype per class + global matrix,

different objective function!


http://matlabserver.cs.rug.nl/gmlvqweb/web/

Matlab collection:

Relevance and Matrix adaptation in Learning Vector

Quantization (GRLVQ, GMLVQ and LiRaM LVQ)

http://www.cs.rug.nl/~biehl/

links

Pre/re-prints etc.:

Challenging data sets ?

[email protected]


Questions ?

?


uniqueness / regularization

quadratic distance measure (positive semi-definite pseudo-metric)

intrinsic representation

by linear transformation

uniqueness (i)

matrix square root is not unique*

canonical representation, e.g.

* irrelevant rotations, reflections, symmetries


uniqueness of relevance matrix

uniqueness (ii)

given mapping:

i.e. the rows of lie in the null-space of

is possible if exists with

→ identical mapping of all examples and prototypes,

same distances and classification scheme w.r.t. training data

is singular if features are highly correlated, interdependent


a simple example

uniqueness of relevance matrix

contributions cancel exactly if

(disregarded in the classification)

but naïve interpretation of diagonal

suggests high relevance!

consider two identical, entirely

irrelevant features, e.g.


posterior null-space projection

training process yields

determine with eigenvectors and eigenvalues

column space projection:

with

Note: minimizes under the condition

formal solution: (Moore-Penrose pseudo-inverse)

removes null-space contributions


posterior regularization

regularization:

with

- retains the eigenspace corresponding to largest eigenvalues only

- removes also eigenspace of (small) non-zero eigenvalues

- potentially improved generalization performance

- smoothens the mapping, less data set specific

training process yields

determine with eigenvectors and eigenvalues


posterior regularization

regularized mapping

after/during training

pre-processing of data

(PCA-like)

mapped feature spacefixed K prototypes yet unknown (*)

retains original featuresflexible K can include prototypes

here:

posterior regularization in classification schemes

dependence of generalization performance on parameter K

improved interpretability of the mapping / distance measure

(*) remark: prototypes are (close to) linear combinations of feature vectors

when converged


illustrative example

infra-red spectral data: 124 wine spamples

256 wavelengths 30 training data

94 test spectra

alco

hol co

nte

nt

(bin

ned

)

GMLVQ classification

(here)

high correlation of features (neighbor channels)

and P=30 → effective dimension ≪ 256 can be expected


illustrative example

best performance7 dimensions remaining

over-fitting

effect

null-space correctionP=30 dimensions

regularization (beyond column space projection)

- potentially enhances generalization, controls over-fitting


regularization

- enhances generalization

- smoothens relevance profile/matrix

- removes ‘false relevances’

- improves interpretability of Λ

before and after …


http://matlabserver.cs.rug.nl/gmlvqweb/web/

Matlab collection:

Relevance and Matrix adaptation in Learning Vector

Quantization (GRLVQ, GMLVQ and LiRaM LVQ)

http://www.cs.rug.nl/~biehl/

links

Pre/re-prints etc.:


Questions ?

?

Documents

Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen Prototype-based learning and