Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Clustering of Biomedical Data2017
Richard Röttger
1
CBioVikingsCopenhagen
February 2nd, 2017
Clustering of Biomedical Data2017
Who is talking?
2
Clustering of Biomedical Data2017
Resources
3
Go to
http://imada.sdu.dk/~roettger/teaching/cbiovikings.php
You will find The dataset
These slides
An overview paper
A small R script for a cluster Analysis
R Tutorial
R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300
Clustering of Biomedical Data2017
Clustering in Life Sciences
4
Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’
such that those within each cluster are more closely related to one another than objects assigned to different clusters.“
Image taken from: http://clusteval.sdu.dk/
Clustering of Biomedical Data2017
Clustering in Life Sciences
5
Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’
such that those within each cluster are more closely related to one another than objects assigned to different clusters.“
Applied in almost every scientific field, e.g.: Information retrieval
Economics and marketing
Astronomy
Bioinformatics
Image taken from: http://clusteval.sdu.dk/
Clustering of Biomedical Data2017
Clustering in Life Sciences
6
Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’
such that those within each cluster are more closely related to one another than objects assigned to different clusters.“
Applied in almost every scientific field, e.g.: Information retrieval
Economics and marketing
Astronomy
Bioinformatics
In Bioinformatics Homology detection
Gene expression study
Protein complex prediction
Image taken from: http://clusteval.sdu.dk/
Clustering of Biomedical Data2017
Complexity of Clustering
7
Clustering of Biomedical Data2017
Complexity of Clustering
8
Most Pressing Issues:
• What tool to use?
• How to find a best clustering?
• How to tune the parameters of a tool?
• How measure to do this in a reliable and reproducible manner?
Clustering of Biomedical Data2017
Graphical Analysis
9
Clustering of Biomedical Data2017
First, let’s have a look!
10
Good way to gain an overview
Histograms and Scatterplots
Can be misleading
Hard to automatize
Clustering of Biomedical Data2017
Scatterplots
11
How many clusters do you see?
This is so-called overplotting.
Only meaningful for bivariate data
Clustering of Biomedical Data2017
Density Estimation
12
We have seen that we have a couple of problems Overplotting
Wrong bin size can easily hide interesting features
Now, let’s consider a different approach Assume that our dataset originates from some probability density function
If we would know the type and specifics of this density function, we would have all the information we need for a clustering
BUT: We do not have this information!
And we do not want to make any assumption (i.e., that is the so-called non-parametric density estimation)
Clustering of Biomedical Data2017
Histogram as a Density Estimate
13
Divide the sample space into a number of bins
Approximate the density at the center of each bin by counting
Clustering of Biomedical Data2017
Drawbacks of a Histogram
14
The density estimate depends on the starting position of the bins
Discontinuities are not due to the underlying density;
Curse of dimensionality: number of bins grows exponentially with the number of dimensions In high dimensions many examples are needed in order to have non-empty
bins
Therefore: Unsuitable for high dimensions
More sophisticated density estimators required
Clustering of Biomedical Data2017
Kernel Density Estimators; Parzen Windows
15
We can estimate a density function by employing a kernel function K:
Notice how the Parzen window estimate resembles the histogram, with the exception that the bin locations are determined by the data
Clustering of Biomedical Data2017
Different Kernel Functions
16
Clustering of Biomedical Data2017
Revisiting our Example
17
Clustering of Biomedical Data2017
Pre-Processing
18
Clustering of Biomedical Data2017
Preprocessing: Feature Extraction / Selection
19
Observation Features might be correlated
Features might useless for a clustering
Features might even be blurring the cluster structure
Feature Selection Utilizes only a subset of the available Features
Most methods are coupled with a mining tool to determine optimality
Feature Extraction Creates new features out of the existing features
Seeks to create uncorrelated, better features
Examples: PCA, PCoA
Clustering of Biomedical Data2017
Preprocessing: Normalization
20
Feature1: [0,1] Feature2: [1000,80000]
Normalization: Bring both features to [0,1]
Clustering of Biomedical Data2017
Preprocessing: Normalization
21
Feature1: [0,1] Feature2: [1000,80000]
Normalization: Bring both features to [0,1]
=> Bad with outliers!
Clustering of Biomedical Data2017
Preprocessing: Normalization
22
Feature1: [0,1] Feature2: [1000,80000]
Normalization: Bring both features to [0,1]
=> Bad with outliers!
Standardization: The values are scaled by the deviation from the mean:
Clustering of Biomedical Data2017
Preprocessing: Normalization
23
Feature1: [0,1] Feature2: [1000,80000]
Normalization: Bring both features to [0,1]
=> Bad with outliers!
Standardization: The values are scaled by the deviation from the mean:
Generally: Loss of scale and location!
Clustering of Biomedical Data2017
PCA
24
PCA is a very complex and large topic which can basically fill entire lecture series
Furthermore, there are many interpretations and different applications for a PCA1
Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space
With the intention of simplify clustering
Hopefully provides a better means for visual inspection
see for example: https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/
Clustering of Biomedical Data2017
PCA
25
PCA is a very complex and large topic which can basically fill entire lecture series
Furthermore, there are many interpretations and different applications for a PCA1
Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space
With the intention of simplify clustering
Hopefully provides a better means for visual inspection
The task of a PCA is to perform a dimensionality reduction in such a way that most of the variance in the original data is preserved
see for example: https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/
Clustering of Biomedical Data2017
An Example
26
Clustering of Biomedical Data2017
An Example
27
Clustering of Biomedical Data2017
How does a PCA work?
28
The PCA performs a basis transformation, in which the first basis vector is the vector accounting for most of the variance in the dataset, the second for the most of the remaining variance and so on ...
These basis vectors can be found by the eigenvalue decomposition of the covariance matrix 𝑄 or the sample correlation matrix 𝑅.
The eigenvalues 𝜆1, … , 𝜆𝑑 indicate the variance of the eigenvectors 𝑦1, … , 𝑦𝑑
Clustering of Biomedical Data2017
The Co-Variance Matrix
29
The co-variance is defined as
This is the observed covariance for 𝑛 observations 𝑥𝑖 , 𝑦𝑖
The co-variance matrix is then defined as
The covariance matrix generalizes the notion of variance to multiple dimensions
Clustering of Biomedical Data2017
Example: PCA
30
Clustering of Biomedical Data2017
Example: PCA
31
Clustering of Biomedical Data201732
Image taken from Ricardo Gutierrez-Osuna’s class on Pattern Analysis
Clustering of Biomedical Data2017
Proximity Calculation
33
Clustering of Biomedical Data2017
Different Proximity Measures
34
Similarity Numerical measure of how alike two data objects are
Is higher when objects are more alike
Often falls in the range [0, 1]
Dissimilarity Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Often called Distance if it fulfills metric properties
Clustering of Biomedical Data2017
One-mode / Two-mode
35
One-mode
In a one-mode dataset the data is given in a 𝑛 × 𝑛-matrix 𝑃 = (𝑝𝑖𝑗)
𝑝𝑖𝑗 relates each pair of objects 𝑥𝑖 and 𝑥𝑗 with each other
Also often called a similarity/dissimilarity Matrix
Normally, a one-mode matrix is symmetric
Called one-mode as columns and rows describe the same thing
Two-mode A two-mode dataset normally comes as a 𝑛 × 𝑑-matrix
Each object is in a row, with each property being stored in a different column
Sometimes, this mode is called the Raw-data
A row is also sometimes called a feature vector
Clustering of Biomedical Data2017
Proximity Calculation: Continues Data
36
Euclidean Type of measures (Minkowski Distance)
Image taken from wikipedia.com
Clustering of Biomedical Data2017
Proximity Calculation: Continues Data
37
Correlation Coefficient
Clustering of Biomedical Data2017
Similarity Measures for Binary Variables
38
Most of the measure define a similarity on the count of different mismatches of two objects in the 𝑑 variables
Generally saying, the counts 𝑎 and 𝑑 can be seen as matches, the counts 𝑏 and 𝑐 as mismatches
While 𝑏 and 𝑐 can be seen as equivalent, this is certainly not true for the matching states 𝑎 and 𝑑
Clustering of Biomedical Data2017
Similarity Measures for Binary Variables
39
Matching Coefficient
Jaccard coefficient
When the presence of a feature has the same “explanatory power” as the absence, the Matching Coefficient is applied, otherwise the Jaccard coefficient
Clustering of Biomedical Data2017
Similarity Measures for Categorical Data
40
A straightforward way would be treating each level of the categorical variable as own binary variable and apply the known measures Let’s say the variable eye-color ∈ {blue, brown, green, gray}
Can be converted into binary variables “has blue eyes”, “has brown eyes”, ...
Problem: By default, many negative matches
Therefore: It is often counted how often two objects agree on the different variables
Clustering of Biomedical Data2017
Proximity Calculation: Specialized Functions
41
These Standard Methods are often not sufficient for biological data, as we neither have categorical data of an embedding in a n-dimensional space
How to embed a sequence? A network? A Protein structure?
Specialized Measures:
BLAST
Network Edit Distance
Protein structure alignments
Clustering of Biomedical Data2017
Clustering
42
Clustering of Biomedical Data2017
From A Criteria to Algorithm
43
Each clustering tool optimizes some inherent idea of a perfect clustering
They are all only approximations!
Possibilities to separate 𝑛 objects into 𝑘 clusters:
𝑁 2,5 = 15
𝑁 10,3 = 9330
𝑁 50,4 = 5.3 ⋅ 1028
𝑁 100,5 = 6.6 ⋅ 1067
There are estimated 4 ⋅ 1080±1 atoms in the observable universe
Clustering of Biomedical Data2017
From A Criteria to Algorithm
44
Each clustering tool optimizes some inherent idea of a perfect clustering
They are all only approximations!
Possibilities to separate 𝑛 objects into 𝑘 clusters:
𝑁 2,5 = 15
𝑁 10,3 = 9330
𝑁 50,4 = 5.3 ⋅ 1028
𝑁 100,5 = 6.6 ⋅ 1067
There are estimated 4 ⋅ 1080±1 atoms in the observable universe
It is important to know what exactly the clustering algorithm optimizes!
Clustering of Biomedical Data2017
Tool Selection: Overview
45
k-Means based
Hierarchical
Graph-based
Density-based
Others
Clustering of Biomedical Data2017
Tool Selection: k-Means based
46
Most popular clustering tool
Two-step iterative process: Assign objects to closest centers
Updates these centers
Good time complexity (almost linear)
Minimizes the mean-squared-errorof the objects to the cluster centers
Works quite well in practice
Clustering of Biomedical Data2017
Tool Selection: Problems with k-means
47
Sensitive to initialization: how do we choose the initial partitions?
Clustering of Biomedical Data2017
Tool Selection: Problems with k-means
48
Sensitive to initialization: how do we choose the initial partitions?
Run several iterations
(Subset) Furthest-first initialization
Clustering of Biomedical Data2017
Tool Selection: Problems with k-means
49
k-Means prefers hyperspherical clusters of approximately the same size
Image taken from wikipedia.com
Clustering of Biomedical Data2017
How to find the best 𝑘?
50
No easy answer to that
Employ domain knowledge
Use internal cluster validity indices
Use GAP statistic
Clustering of Biomedical Data2017
Tool Selection: Hierarchical
51
Creates a hierarchical embeddingof the clustering
Two main branches Agglomerative
Divisive
Image: Brazma, Alvis, and Jaak Vilo. "Gene expression data analysis." FEBS letters 480.1 (2000): 17-24.
Clustering of Biomedical Data2017
Tool Selection: Single Linkage
52
The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.
Clustering of Biomedical Data2017
Tool Selection: Complete Linkage
53
The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters
Clustering of Biomedical Data2017
Tool Selection: Average Linkage
54
The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters
Compromise between Single and Complete Link
Strengths Less susceptible to noise and outliers
Limitations Biased towards spherical clusters
Clustering of Biomedical Data2017
Tool Selection: Overview
55
k-Means based
Hierarchical
Graph-based Represent the data as a graph
Identifying densely connectedareas in the graph
Examples: MCL, Transitivity Clustering,Affinity Propagation
Used for: Network and Complexanalysis
Density-based
… Images: Vlasblom, James, and Shoshana J. Wodak. "Markov clustering versus affinity propagation for the partitioning of protein
interaction graphs." BMC bioinformatics 10.1 (2009): 1.
Clustering of Biomedical Data2017
Tool Selection: Overview
56
k-Means based
Hierarchical
Graph-based
Density-based separating high-density areas
from low-density areas
Very Efficient
Arbitrary cluster shape
Require embedding of the objects
…
Clustering of Biomedical Data2017
Cluster Evaluation
57
Clustering of Biomedical Data2017
Evaluate a Clustering
58
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this
direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
No puppets were harmed in the production of this lecture; generally, the usage of black magic is limited to a minimum at SDU.
Clustering of Biomedical Data2017
Overview of Cluster Validation
59
Two different kinds of measures can be distinguished
External Measures
Compare two clusterings
Use a gold-standard to evaluate the quality of a clustering
Internal Measures
Only use the clustering as basis for evaluation
Comparable to cluster criteria
Clustering of Biomedical Data2017
External Measures
60
We can look at each pair of points and define
Or map each cluster 𝑐𝑗 to the gold standard cluster 𝑘𝑖 with the
highest overlap
TP if 𝑎 ∈ 𝑘𝑖 ∧ 𝑎 ∈ 𝑐𝑗
FP if 𝑎 ∉ 𝑘𝑖 ∧ 𝑎 ∈ 𝑐𝑗
FN if 𝑎 ∈ 𝑘𝑖 ∧ 𝑎 ∉ 𝑐𝑗
Clustering of Biomedical Data2017
Now we can define Measures
61
Rand Index (pair-wise)
Jaccard Index (pair-wise)
F-measure (mapping)
Clustering of Biomedical Data2017
Internal Measures
62
Do not have additional information of the ground truth at disposal
Similar to cluster criteria
Normally based on:
Compactness: this measures how closely related the objects in a cluster are
Separation: this measures how distinct or well-separated a cluster is from other clusters
Clustering of Biomedical Data2017
Internal Measures
63
Dunn Index The Dunn Index assesses the clustering performance by relating the
maximal cluster diameter to the minimal distance between clusters
This measure is prone to outliers for it is based on minimal and maximal distances
Davis Bouldin Index The Davies Bouldin Index DB is defined based on the average distances
between objects and their cluster centroids
Clustering of Biomedical Data2017
Silhouette Coefficient
64
Based on: Cohesion 𝑎(𝑥): average with-
in cluster distance of 𝑥
Separation 𝑏(𝑥): average distance of 𝑥 to the closest other cluster
Takes values between -1 and 1
Clustering of Biomedical Data2017
What to do?
65
Such a method does not exists for all use cases How to proceed then?
Is there a general rule we could follow?
ClustEval: Fully automatizes the clustering
We tested 13 clustering methods
On 24 datasets (12 real-world, 12 artificial)
13 common validity measures
1000 parameter sets per tool per dataset
Clustering of Biomedical Data2017
Results of ClustEval
67
http://clusteval.sdu.dk/
There is no general best performer among the tools
Quite often internal and external measures do not agree on the performance assessment
When using only biomedical datasets, the Silhouette Value has the best agreement with external measures
Clustering of Biomedical Data2017
Workshop Introduction
68
Clustering of Biomedical Data2017
BreathOMICS data
69
Clustering of Biomedical Data2017
What is it good for?
70
0 120 240 360 480 600 720
0.15
0.30
0.45
0.60
0.75
0.90
3-Pentanone
Monomer
& Dimer
-Graph314.04.2013 12:38
© J.I. Baumbach - B&S Analytik, Dortmund, Germany
Ave
rag
ed
Y
Zeitskala einzeln normiert / a.u.
SHAM S
CLI S
Clustering of Biomedical Data2017
Data Preprocessing
71
RAW Smoothed De-noised
Clustering of Biomedical Data2017
Peak Detection
72
- Local maxima search (LMS)
- Merged peak cluster localization (MPCL) Bader et al. 2008
- Wavelet-based multi-scale peak detection Bader et al. 2008
- Water shed transformation (WST) Bunkowski et al. 2011
- Peak model estimation (PME) Kopczynski et al. 2012
Clustering of Biomedical Data2017
BreathOMICS
73
Substances
Pati
ents
Clustering of Biomedical Data2017
Peak Alignment
74
Clustering of Biomedical Data2017
Resources
75
Go to
http://imada.sdu.dk/~roettger/teaching/cbiovikings.php
You will find The dataset
These slides
An overview paper
A small R script for a cluster Analysis
R Tutorial
R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300