Upload
terence-ellis
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Michael BiehlJohann Bernoulli Institute for Mathematics and Computer ScienceUniversity of Groningen
www.cs.rug.nl/biehl
Prototype-based learning and adaptive distances for
classification
Brain Inspired Computing, Cetraro, July 2013 2
overview
Basic concepts of similarity / distance based classification
example system: Learning Vector Quantization (LVQ)
application: Classification of Adrenal Tumors
Distance measures and Relevance Learning
predefined distances, e.g. divergence based LVQ
application: Detection of Cassava Mosaic Disease
adaptive distances, e.g. Matrix Relevance LVQ
application: Classification of Adrenal Tumors (cont’d)
extensions: combined distances, relational data
(excursion: uniqueness and regularization of relevance matrices)
Part I: Basic concepts of distance/similarity based
classification
Brain Inspired Computing, Cetraro, July 2013 4
classification problems
- character/digit/speech recognition
- medical diagnoses
- pixel-wise segmentation in image processing
- object recognition/scene analysis
- fault detection in technical systems
- remote sensing
...
machine learning approach:
extract information from example data
parameterized in a learning system (neural network, LVQ, SVM...)
working phase: application to novel data
here only: supervised learning , classification:
Brain Inspired Computing, Cetraro, July 2013 5
distance based classification
assignment of data (objects, observations,...)
to one or several classes (crisp/soft) (categories, labels)
based on comparison with reference data (samples, prototypes)
in terms of a distance measure (dis-similarity, metric)
representation of data (a key step!)
- collection of qualitative/quantitative descriptors
- vectors of numerical features
- sequences, graphs, functional data
- relational data, e.g. in terms of pairwise (dis-) similarities
Brain Inspired Computing, Cetraro, July 2013
K-NN classifier
a simple distance-based classifier
- store a set of labeled examples
- classify a query according to the label of the Nearest Neighbor (or the majority of K NN)
- local decision boundary acc. to (e.g.) Euclidean distances
?
- piece-wise linear class borders parameterized by all examples
feature space
+ conceptually simple, no training required, one parameter (K)
- expensive storage and computation, sensitivity to “outliers”can result in overly complex decision boundaries
Brain Inspired Computing, Cetraro, July 2013
prototype based classification
a prototype based classifier [Kohonen 1990, 1997]
- represent the data by one or several prototypes per class
- classify a query according to the label of the nearest prototype (or alternative schemes)
- local decision boundaries according to (e.g.) Euclidean distances
- piece-wise linear class borders parameterized by prototypes
feature space
?
+ less sensitive to outliers, lower storage needs, little computationaleffort in the working phase
- training phase required in order to place prototypes,model selection problem: number of prototypes per class, etc.
Brain Inspired Computing, Cetraro, July 2013
set of prototypes
carrying class-labels
based on dissimilarity/distance measure
nearest prototype classifier (NPC):
given - determine the winner
- assign to the class
most prominent example: (squared) Euclidean distance
Nearest Prototype Classifier
xx
reasonable requirements:
Brain Inspired Computing, Cetraro, July 2013
∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
• initialize prototype vectors for different classes
heuristic scheme: LVQ1 [Kohonen, 1990, 1997]
• identify the winner (closest prototype)
• present a single example
• move the winner
- closer towards the data (same class)
- away from the data (different class)
Brain Inspired Computing, Cetraro, July 2013
∙ identification of prototype vectors from labeled example data
∙ distance based classification (e.g. Euclidean)
Learning Vector Quantization
N-dimensional data, feature vectors
∙ tesselation of feature space [piece-wise linear]
∙ distance-based classification [here: Euclidean distances]
∙ generalization ability correct classification of new data
∙ aim: discrimination of classes ( ≠ vector quantization or density estimation )
Brain Inspired Computing, Cetraro, July 2013
sequential presentation of labelled examples
… the winner takes it all:
learning rate
many heuristic variants/modifications: [Kohonen, 1990,1997]
- learning rate schedules ηw (t) [Darken & Moody, 1992]
- update more than one prototype per step
iterative training procedure:
randomized initial , e.g. close to the class-conditional means
LVQ1
LVQ1 update step:
Brain Inspired Computing, Cetraro, July 2013
LVQ1 update step:
LVQ1-like update forgeneralized distance:
requirement:
update decreases (increases) distance if classes coincide (are different)
LVQ1
Brain Inspired Computing, Cetraro, July 2013
Generalized LVQ
one example of cost function based training: GLVQ [Sato & Yamada, 1995]
sigmoidal (linear for small arguments), e.g.
E approximates number of misclassifications
linear
E favors large margin separation of classes, e.g.
two winning prototypes:
minimize
Brain Inspired Computing, Cetraro, July 2013
GLVQ
training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples,
update of two prototypes per step
based on non-negative, differentiable distance
Brain Inspired Computing, Cetraro, July 2013
GLVQ
training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples,
update of two prototypes per step
based on non-negative, differentiable distance
Brain Inspired Computing, Cetraro, July 2013
GLVQ
training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples,
update of two prototypes per step
based on Euclidean distance
moves prototypes towards / away from
sample with prefactors
Brain Inspired Computing, Cetraro, July 2013
+ frequently applied in a
variety of practical problems
+ intuitive interpretation
prototypes defined in feature space
+ natural for multi-class problems
- often based on purely heuristic arguments … or …
cost functions with unclear relation to classification error
Important issue: which is the ‘right’ distance measure ?
prototype/distance based classifiers
- model/parameter selection (# of prototypes, learning rate, …)
features may- scale differently - be of completely different nature - be highly correlated / dependent …
simple Euclidean distance ?
+ flexible, easy to implement
Brain Inspired Computing, Cetraro, July 2013
related schemes
Many variants of LVQ
intuitive schemes: LVQ2.1, LVQ3, OLVQ, ...
cost function based: RSLVQ (likelihood ratios)
Supervised Neural Gas (NG)
many prototypes, rank based update
Supervised Self-Organizing Maps (SOM)
neighborhood relations, topology preserving mapping
Radial Basis Function Networks (RBF)
hidden units = centers (prototypes) with Gaussian activation
Brain Inspired Computing, Cetraro, July 2013
remark: the curse of dimension ?
concentration of distances for large N
„distance based methods are bound to fail in high dimensions“ ???
LVQ:
- prototypes are not just random data points
- carefully selected low-noise representatives of the data
- distances of a given data point to prototypes are compared
projection to non-trivial
low-dimensional subspace!
[Ghosh et al., 2007, Witoelar et al., 2010]
models of LVQ training, analytical treatment in the limit
successful training needs training examples
see also:
Brain Inspired Computing, Cetraro, July 2013 20
Questions ?
?
An example problem: classification of adrenal tumors
Wiebke Arlt , Angela TaylorDave J. Smith, Peter Nightingale P.M. Stewart, C.H.L. Shackleton et al.
Petra SchneiderHan Stiekema Michael Biehl
Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen
School of MedicineQueen Elizabeth HospitalUniversity of Birmingham/UK(+ several centers in Europe)
tumor classification
[Arlt et al., J. Clin. Endocrinology & Metabolism,
2011]
Brain Inspired Computing, Cetraro, July 2013
∙ adrenal tumors are common (1-2%)
and mostly found incidentally
∙ adrenocortical carcinomas (ACC) account
for 2-11% of adrenal incidentalomas
( ACA: adrenocortical adenomas )
∙ conventional diagnostic tools lack sensitivity
and are labor and cost intensive (CT, MRI)
www.ensat.org
adrenal gland
∙ idea: tumor classification based on steroid excretion profile
tumor classification
Brain Inspired Computing, Cetraro, July 2013
- urinary steroid excretion (24 hours) - 32 potential biomarkers - biochemistry imposes correlations, grouping of steroids
tumor classification
Brain Inspired Computing, Cetraro, July 2013
ACA patient #
ACC patient #
# steroid marker
102 patients with benign ACA
45 patients with malignant ACC
color coded excretion values(log. scale, relative to healthy controls)
data set:
tumor classification
Brain Inspired Computing, Cetraro, July 2013
Generalized LVQ , training and performance evaluation
∙ data divided in 90% training and 10% test set
∙ determine prototypes by stochastic gradient descent
typical profiles (1 per class)
∙ apply classifier to test data
evaluate performance (error rates)
∙ employ Euclidean distance measure
in the 32-dim. feature space
∙ repeat and average over many random splits
tumor classification
Brain Inspired Computing, Cetraro, July 2013
ACA
ACC
prototypes: steroid excretion in ACA/ACC
tumor classification
Brain Inspired Computing, Cetraro, July 2013
∙ Receiver Operator Characteristics (ROC) [Fawcett, 2000]
obtained by introducing a biased NPC:
false positive rate(1-specificity)
true p
osi
tive r
ate
(s
ensi
tivit
y)
θ = 0
rand
om g
uess
ing
Area under Curve
all tumors classified as ACA
- no false alarms
- no true positives detected
all tumors classified as ACC
- all true positives detected
- max. number of false alarms
tumor classification
Brain Inspired Computing, Cetraro, July 2013
ROC characteristics (averaged over splits of the data set)
AUC=0.87
GLVQ performance:
tumor classification
Brain Inspired Computing, Cetraro, July 2013 29
Questions ?
?
Part II: distance measures and relevance learning
Brain Inspired Computing, Cetraro, July 2013 31
distance measures
fixed distance measures:
- select distance measures according to prior knowledge
- data driven choice in a preprocessing step
- determine prototypes for a given distance
- compare performance of various measures
example: divergence based LVQ
Brain Inspired Computing, Cetraro, July 2013
Relevance Matrix LVQ
generalized quadratic distance in LVQ:
variants: one global, several local, class-wise relevance matrices → piecewise quadratic decision boundaries
rectangular discriminative low-dim. representation e.g. for visualization [Bunte et al., 2012]
possible constraints: rank-control, sparsity, …
normalization:
diagonal matrices: single feature weights [Bojer et al., 2001] [Hammer et al., 2002]
[Schneider et al., 2009]
Brain Inspired Computing, Cetraro, July 2013
Relevance Matrix LVQ
optimization of prototypes and distance measure
WTA
Matrix-LVQ1
Brain Inspired Computing, Cetraro, July 2013
Relevance Matrix LVQ
Generalized
Matrix-LVQ
(gradients of )
optimization of prototypes and distance measure
Brain Inspired Computing, Cetraro, July 2013 35
heuristic interpretation
summarizes
- the contribution of the original dimension
- the relevance of original features for the classification
interpretation assumes implicitly:
features have equal order of magnitude
e.g. after z-score-transformation →
(averages over data set)
standard Euclidean distance for
linearly transformed features
Brain Inspired Computing, Cetraro, July 2013
Relevance Matrix LVQ
optimization of prototype positions
distance measure(s) in one training process (≠ pre-processing)
motivation:
improved performance - weighting of features and pairs of features
simplified classification schemes - elimination of non-informative, noisy features - discriminative low-dimensional representation
insight into the data / classification problem - identification of most discriminative features - incorporation of prior knowledge (e.g. structure of Ω)
Brain Inspired Computing, Cetraro, July 2013
Generalized Matrix LVQ , ACC vs. ACA classification
∙ data divided in 90% training, 10% test set, (z-score transformed)
∙ determine prototypes
typical profiles (1 per class)
∙ apply classifier to test data
evaluate performance (error rates, ROC)
∙ adaptive generalized quadratic distance measure
parameterized by
∙ repeat and average over many random splits
tumor classification (cont’d)
[Arlt et al., 2011][Biehl et al., 2012]
Brain Inspired Computing, Cetraro, July 2013
off-diagonaldiagonal elements
fraction of runs (random splits) in which asteroid is rated among 9 most relevant markers
subset of 9 selected steroids ↔ technical realization (patented, University
of Birmingham/UK)
tumor classification
Relevance matrix
Brain Inspired Computing, Cetraro, July 2013
off-diagonaldiagonal elements
19
ACA
ACCdiscriminative e.g. steroid 19
tumor classification
Brain Inspired Computing, Cetraro, July 2013
off-diagonaldiagonal elements
8
ACA ACC
non-trivial role:steroid 8 among the most relevant!
tumor classification
Brain Inspired Computing, Cetraro, July 2013
highly discriminativecombination of markers!
weakl
y d
iscr
imin
ati
ve m
ark
ers
12
8
tumor classification
Brain Inspired Computing, Cetraro, July 2013
ROC characteristics
clear improvement due to
adaptive distances
(1-specificity)
(s
ensi
tivit
y)
8
GMLVQ
GRLVQ
diagonal rel.Euclidean
full matrix
AUC0.870.930.97
tumor classification
Brain Inspired Computing, Cetraro, July 2013
observation / theory :
low rank of resulting relevance matrix
often: single relevant eigendirection
eigenvaluesin ACA/ACCclassification
intrinsic regularization
nominally ~ NxN adaptive parameters in Matrix LVQ
reduce to ~ N effective degrees of freedom
low-dimensional representation
facilitates, e.g., visualization of labeled data sets
tumor classification
Stationarity of Matrix Relevance LVQ
[M. Biehl, B. Hammer, F.-M. Schleif, T. Villmann,
IJCNN 2015, in press]
Brain Inspired Computing, Cetraro, July 2013
tumor classification
visualization of the data set
ACAACC
Brain Inspired Computing, Cetraro, July 2013
projection on first eigenvector
pro
ject
ion o
n s
eco
nd e
igenvect
ora multi-class example
classification of coffee samples
based on hyperspectral data
(256-dim. feature vectors)
[U. Seiffert et al., IFF Magdeburg]
prototypes
Brain Inspired Computing, Cetraro, July 2013
related schemes
Relevance LVQ variants
local, rectangular, structured, restricted... relevance matrices
for visualization, functional data, texture recognition, etc.
relevance learning in Robust Soft LVQ, Supervised NG, etc.
combination of distances for mixed data ...
Relevance Learning related schemes in supervised learning ...
RBF Networks [Backhaus et
al., 2012]
Neighborhood Component Analysis [Goldberger et
al., 2005]
Large Margin Nearest Neighbor [Weinberger et al., 2006,
2010]
and many more!
Linear Discriminant Analysis (LDA)
one prototype per class + global matrix,
different objective function!
Brain Inspired Computing, Cetraro, July 2013 47
http://matlabserver.cs.rug.nl/gmlvqweb/web/
Matlab collection:
Relevance and Matrix adaptation in Learning Vector
Quantization (GRLVQ, GMLVQ and LiRaM LVQ)
http://www.cs.rug.nl/~biehl/
links
Pre/re-prints etc.:
Challenging data sets ?
Brain Inspired Computing, Cetraro, July 2013 48
Questions ?
?
Brain Inspired Computing, Cetraro, July 2013
uniqueness / regularization
quadratic distance measure (positive semi-definite pseudo-metric)
intrinsic representation
by linear transformation
uniqueness (i)
matrix square root is not unique*
canonical representation, e.g.
* irrelevant rotations, reflections, symmetries
Brain Inspired Computing, Cetraro, July 2013 50
uniqueness of relevance matrix
uniqueness (ii)
given mapping:
i.e. the rows of lie in the null-space of
is possible if exists with
→ identical mapping of all examples and prototypes,
same distances and classification scheme w.r.t. training data
is singular if features are highly correlated, interdependent
Brain Inspired Computing, Cetraro, July 2013 51
a simple example
uniqueness of relevance matrix
contributions cancel exactly if
(disregarded in the classification)
but naïve interpretation of diagonal
suggests high relevance!
consider two identical, entirely
irrelevant features, e.g.
Brain Inspired Computing, Cetraro, July 2013 52
posterior null-space projection
training process yields
determine with eigenvectors and eigenvalues
column space projection:
with
Note: minimizes under the condition
formal solution: (Moore-Penrose pseudo-inverse)
removes null-space contributions
Brain Inspired Computing, Cetraro, July 2013 53
posterior regularization
regularization:
with
- retains the eigenspace corresponding to largest eigenvalues only
- removes also eigenspace of (small) non-zero eigenvalues
- potentially improved generalization performance
- smoothens the mapping, less data set specific
training process yields
determine with eigenvectors and eigenvalues
Brain Inspired Computing, Cetraro, July 2013 54
posterior regularization
regularized mapping
after/during training
pre-processing of data
(PCA-like)
mapped feature spacefixed K prototypes yet unknown (*)
retains original featuresflexible K can include prototypes
here:
posterior regularization in classification schemes
dependence of generalization performance on parameter K
improved interpretability of the mapping / distance measure
(*) remark: prototypes are (close to) linear combinations of feature vectors
when converged
Brain Inspired Computing, Cetraro, July 2013 55
illustrative example
infra-red spectral data: 124 wine spamples
256 wavelengths 30 training data
94 test spectra
alco
hol co
nte
nt
(bin
ned
)
GMLVQ classification
(here)
high correlation of features (neighbor channels)
and P=30 → effective dimension ≪ 256 can be expected
Brain Inspired Computing, Cetraro, July 2013 56
illustrative example
best performance7 dimensions remaining
over-fitting
effect
null-space correctionP=30 dimensions
regularization (beyond column space projection)
- potentially enhances generalization, controls over-fitting
Brain Inspired Computing, Cetraro, July 2013 57
regularization
- enhances generalization
- smoothens relevance profile/matrix
- removes ‘false relevances’
- improves interpretability of Λ
before and after …
Brain Inspired Computing, Cetraro, July 2013 58
http://matlabserver.cs.rug.nl/gmlvqweb/web/
Matlab collection:
Relevance and Matrix adaptation in Learning Vector
Quantization (GRLVQ, GMLVQ and LiRaM LVQ)
http://www.cs.rug.nl/~biehl/
links
Pre/re-prints etc.:
Brain Inspired Computing, Cetraro, July 2013 59
Questions ?
?