Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Practical Tools forSelf-OrganizingMaps
Lutz HamelDept. of Computer Science &StatisticsURI
Chris BrownDept. of ChemistryURI
Self-Organizing Maps(SOMs) A neural network approach to
unsupervised machine learning. Appealing visual presentation of
learned results as a 2D map.
T. Kohonen, Self-organizing maps, 3rd ed. Berlin ; New York: Springer, 2001.
Self-Organization andLearning
Self-organization refers to a processin which the internal organization of asystem increases automaticallywithout being guided or managed byan outside source.
This process is due to localinteraction with simple rules.
Local interaction gives rise to globalstructure.
Complexity : Life at the Edge of Chaos, Roger Lewin, University Of Chicago Press; 2nd edition, 2000
We can interpret emerging globalstructures as learned structures.
Learned structures appear asclusters of similar objects.
Self-Organizing Maps
Data Table
“Grid of Neurons”
Algorithm:
Repeat until Done For each row in Data Table Do Find the neuron that best describes the row. Make that neuron look more like the row. Smooth the immediate neighborhood of that neuron. End ForEnd Repeat
≡
Visualization
Feature Vector Construction
In order to use SOMs we need to describe our objects Feature Vectors
small medium big Tw olegs Fourlegs Hair Hooves Mane Feathers Hunt Run Fly Sw im
1 0 0 1 0 0 0 0 1 0 0 0 1
small medium big Tw olegs Fourlegs Hair Hooves Mane Feathers Hunt Run Fly Sw im
0 0 1 0 1 1 1 0 0 0 1 0 0
Training a SOM
Table of Feature VectorsVisualization
“Grid of Neurons”
Applications of SOM
Infrared Spectroscopy Goal: to find out if compounds are chemically
related without performing an expensivechemical analysis.
Each compound is tested for lightabsorbency in the infrared spectrum.
Specific chemical structures absorb specificranges in the infrared spectrum.
This means, each compound has a specific“spectral signature”.
We can use SOMs to investigate similarity.
Training SOM with Spectra
Grid of Neurons
Random Number Spectra
3000 2000
1000
Spectral Library
Self-Organizing-MapMIR Spectra
MIR SOMFunctional Groups
Alkanes
Alcohols
Acids
Ketones/Aldehyde
CxHy-NH2
Esters
Aromatics
NH2-Ν
Clx-Ν
Ester-Ν
CxHy-Ν
MIRCentroid Spectra
4000 3000 2000 1000 500Wavenumber, cm-1
MIRSignificance Spectrum
Wavenumber, cm-1
Sig
nific
ance
4000 3000 2000 1000
-OH
-CH3
-CH2- -C=O
-C-O-
ΝΝΝH
NIR SOM
Aromatics
Alcohols
Carbonyls
Acids
Alkanes
Amines
NIR SOMFunctional Groups
NH2-Ν
Clx-ΝEster-Ν
CH3-ΝPoly-Ν
CxHy-Ν
NIRCentroid Spectra
5000 6000 7000 4000 Wavenumber, cm-1
NIRSignificance Spectrum
5000 6000 7000 4000 Wavenumber, cm-1
Sig
nific
ance
-OH -CH3-CH2-
Ν
Ν-H
Significance Spectrum
SOMBacterium b-cereus on different agars
Mannitol
Nutrient Blood
Chocolate Blood
“You are what you eat!”
Significance Spectrum b-cereus on different agars
2000 1600 1200 800 Wavenumber, cm-1
Abs
orba
nce
Sig
nific
ance
Sig nif icance
Chocolate b lood
Nutrient
Mannitol
Blood
SOMBacteria Spectra
spores / vegetative
b-subtilis b-cereus
b-anthracis
b-thur
b-thur
b-subtilis
b-cereus
Significance Spectrum vsb-subtilis 1st Derivative Spectra
2000 1600 1200 800 Wavenumber, cm-1
vegetative
spores
Sig nif icance
Abs
orba
nce
Sig
nific
ance
Applications of SOM
Genome Clustering Goal: trying to understand the
phylogenetic relationship betweendifferent genomes.
Compute bootstrap support ofindividual genomes for differentphylogentic tree topologies, thencluster based on the topology support.
Joint work with Prof. Gogarten, Dept. Molecular Biology, Univ. of Connecticut
Phylogenetic Visualizationwith SOMs
1211
11/44
15
9
Applications of SOM
Clustering Proteins based on thearchitecture of their activation loops. Align the proteins under investigation. Extract the functional centers. Turn 3D representation into 1D
feature vectors. Cluster based on the feature vectors.
Joint work with Dr. Gongqin Sun, Dept. of Cell and Molecular Biology, URI
Structural Classification ofGTPasesCan we structurally distinguish between the Ras and Rho subfamilies?
Ras: 121P, 1CTQ, and 1QRA Rho: 1A2B and 1OW3 F = p-loop, r = 10Å
RasRho1A2B
1CTQ
Two Central Questions
Which features are the most importantones for clustering?
How good is the map?
Variance Matters!
Features with largevariance have a higherprobability of showingstructure than featureswith small variance.
Therefore, featureswith large variancetend to be moresignificant to theclustering process thanfeatures with smallvariance.
Smal
l Var
iianc
e
Large Variance
Bayes Theorem
Using Bayes theorem we turn the observedvariances (observed significances) intosignificance probabilities:
!
P(Ai| +) " observed significance of feature A
i
P(+ | Ai) " probability that feature A
i is significant (significance)
P(Ai) " prior of feature A
i
P(+ | Ak) =
P(Ak
| +)P(Ak)
P(Ai| +)P(A
i)
i
# Significance of feature A
k
Probabilistic FeatureSelection Given the significance probability of
each feature of a data set we can askinteresting questions: How many features do we need to
include in our training data in order tobe 95% sure that we included all thesignificant features?
What is the significance level of thetop 10% of my features?
Feature Selection
Significance Plot
Probability Distribution
Significance = 95%
Features?
Features = 10%
Significance?
Evaluating a SOM
The canonical performance metric formSOMs is the quantization error. Very difficult to related to the training data
(e.g., how small is the optimal quantizationerror?)
Here we take a different approach: we viewa map as non-parametric, generative model.
This gives rise to a new model evaluationcriterion via the classical two sampleproblem.
Generative Models
A generative model is a modelthat we can sample and computenew values of the underlyinginput domain.
The classical generative model isthe Gaussian function, once wehave fitted the function throughour known samples, then we cancompute the probability of anysample of the input domain.
However, the model isparametric; it is governed by themean µ and the standarddeviation σ.
!
Notation : N(µ," 2),#x
Image source: www.wikipedia.com
Insight: SOMs Sample theData Space
Given some distribution in the dataspace, SOM will try to construct asample that looks like it was drawnfrom the same distribution.
It will construct the sample usinginterpolation (neighborhood function)and constraints (the map grid).
We can then measure the quality ofthe map using a statistical two sampleapproach.
Algorithm:
Repeat until Done For each row in Data Table Do Find the neuron that best describes the row. Make that neuron look more like the row. Smooth the immediate neighborhood of that neuron. End ForEnd Repeat
Image source: www.peltarion.com
SOM as a Non-parametricGenerative Model
!
Let D be our training set drawn from a distribution N(µ," 2), then
N(µD,"
D
2 ) is a good approximation to the original distribution
if D is large enough,
N(µ," 2) # N(µD,"
D
2 ).
Now, let M be the set of samples SOM constructs at its map grid
nodes, then we say that SOM is converged if the mean µM
and the
variance "M
2 of the model samples appear to be drawn from the same
underlying distribution N(µ,") as the training data,
N(µ," 2) # N(µM
,"M
2 )
SOM as a Non-parametricGenerative Model
!
Now, the distribution N(µ," 2) is unknown, but we have a good
approximation to it as our training set D, N(µD,"
D
2 ).
Therefore, in order to test for convergence we have to show that,
N(µM
,"M
2 ) # N(µD,"
D
2 ),
or "we test that the model samples and training samples were drawn
from the same distribution".
This is an application of the classical statistical two sample test; we
use the student - t test to test that the means µM
and µD are due to
the same distribution and we use the F - test to show that the
variances "M
2 and "D
2 are due to the same distribution.
SOM as a Non-parametricGenerative Model Observations:
The SOM model is non-parametric (ordistribution free) since there are nodistribution parameters to fit.
We can sample from a SOM model usinglinear interpolation on the node grid.
A converged model is a good fitting model, itmodels the underlying distribution very well.
Conclusions
SOMs have a wide range of applications. We have developed two statistical tools that
allow us to evaluate SOMs very effectively: Probabilistic feature selection Goodness of fit.
In the future we need to address thereliance on normal distributions in ourtests…resampling techniques (e.g.bootstrap)