CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Richard Röttger

1

CBioVikingsCopenhagen

February 2nd, 2017


Who is talking?

2


Resources

3

Go to

http://imada.sdu.dk/~roettger/teaching/cbiovikings.php

You will find The dataset

These slides

An overview paper

A small R script for a cluster Analysis

R Tutorial

R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300



Clustering in Life Sciences

4

Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’

such that those within each cluster are more closely related to one another than objects assigned to different clusters.“

Image taken from: http://clusteval.sdu.dk/

http://clusteval.sdu.dk/



5



Applied in almost every scientific field, e.g.: Information retrieval

Economics and marketing

Astronomy

Bioinformatics





6



Applied in almost every scientific field, e.g.: Information retrieval

Economics and marketing

Astronomy

Bioinformatics

In Bioinformatics Homology detection

Gene expression study

Protein complex prediction




Complexity of Clustering

7


Complexity of Clustering

8

Most Pressing Issues:

• What tool to use?

• How to find a best clustering?

• How to tune the parameters of a tool?

• How measure to do this in a reliable and reproducible manner?


Graphical Analysis

9


First, let’s have a look!

10

Good way to gain an overview

Histograms and Scatterplots

Can be misleading

Hard to automatize


Scatterplots

11

How many clusters do you see?

This is so-called overplotting.

Only meaningful for bivariate data


Density Estimation

12

We have seen that we have a couple of problems Overplotting

Wrong bin size can easily hide interesting features

Now, let’s consider a different approach Assume that our dataset originates from some probability density function

If we would know the type and specifics of this density function, we would have all the information we need for a clustering

BUT: We do not have this information!

And we do not want to make any assumption (i.e., that is the so-called non-parametric density estimation)


Histogram as a Density Estimate

13

Divide the sample space into a number of bins

Approximate the density at the center of each bin by counting


Drawbacks of a Histogram

14

The density estimate depends on the starting position of the bins

Discontinuities are not due to the underlying density;

Curse of dimensionality: number of bins grows exponentially with the number of dimensions In high dimensions many examples are needed in order to have non-empty

bins

Therefore: Unsuitable for high dimensions

More sophisticated density estimators required


Kernel Density Estimators; Parzen Windows

15

We can estimate a density function by employing a kernel function K:

Notice how the Parzen window estimate resembles the histogram, with the exception that the bin locations are determined by the data


Different Kernel Functions

16


Revisiting our Example

17


Pre-Processing

18


Preprocessing: Feature Extraction / Selection

19

Observation Features might be correlated

Features might useless for a clustering

Features might even be blurring the cluster structure

Feature Selection Utilizes only a subset of the available Features

Most methods are coupled with a mining tool to determine optimality

Feature Extraction Creates new features out of the existing features

Seeks to create uncorrelated, better features

Examples: PCA, PCoA


Preprocessing: Normalization

20

Feature1: [0,1] Feature2: [1000,80000]

Normalization: Bring both features to [0,1]



21

Feature1: [0,1] Feature2: [1000,80000]


=> Bad with outliers!



22

Feature1: [0,1] Feature2: [1000,80000]



Standardization: The values are scaled by the deviation from the mean:



23

Feature1: [0,1] Feature2: [1000,80000]



Standardization: The values are scaled by the deviation from the mean:

Generally: Loss of scale and location!


PCA

24

PCA is a very complex and large topic which can basically fill entire lecture series

Furthermore, there are many interpretations and different applications for a PCA1

Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space

With the intention of simplify clustering

Hopefully provides a better means for visual inspection

see for example: https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/


PCA

25

PCA is a very complex and large topic which can basically fill entire lecture series

Furthermore, there are many interpretations and different applications for a PCA1

Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space

With the intention of simplify clustering

Hopefully provides a better means for visual inspection

The task of a PCA is to perform a dimensionality reduction in such a way that most of the variance in the original data is preserved

see for example: https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/


An Example

26


An Example

27


How does a PCA work?

28

The PCA performs a basis transformation, in which the first basis vector is the vector accounting for most of the variance in the dataset, the second for the most of the remaining variance and so on ...

These basis vectors can be found by the eigenvalue decomposition of the covariance matrix 𝑄 or the sample correlation matrix 𝑅.

The eigenvalues 𝜆1, … , 𝜆𝑑 indicate the variance of the eigenvectors 𝑦1, … , 𝑦𝑑


The Co-Variance Matrix

29

The co-variance is defined as

This is the observed covariance for 𝑛 observations 𝑥𝑖 , 𝑦𝑖

The co-variance matrix is then defined as

The covariance matrix generalizes the notion of variance to multiple dimensions


Example: PCA

30


Example: PCA

31


Image taken from Ricardo Gutierrez-Osuna’s class on Pattern Analysis


Proximity Calculation

33


Different Proximity Measures

34

Similarity Numerical measure of how alike two data objects are

Is higher when objects are more alike

Often falls in the range [0, 1]

Dissimilarity Numerical measure of how different two data objects are

Lower when objects are more alike

Minimum dissimilarity is often 0

Upper limit varies

Often called Distance if it fulfills metric properties


One-mode / Two-mode

35

One-mode

In a one-mode dataset the data is given in a 𝑛 × 𝑛-matrix 𝑃 = (𝑝𝑖𝑗)

𝑝𝑖𝑗 relates each pair of objects 𝑥𝑖 and 𝑥𝑗 with each other

Also often called a similarity/dissimilarity Matrix

Normally, a one-mode matrix is symmetric

Called one-mode as columns and rows describe the same thing

Two-mode A two-mode dataset normally comes as a 𝑛 × 𝑑-matrix

Each object is in a row, with each property being stored in a different column

Sometimes, this mode is called the Raw-data

A row is also sometimes called a feature vector


Proximity Calculation: Continues Data

36

Euclidean Type of measures (Minkowski Distance)

Image taken from wikipedia.com


Proximity Calculation: Continues Data

37

Correlation Coefficient


Similarity Measures for Binary Variables

38

Most of the measure define a similarity on the count of different mismatches of two objects in the 𝑑 variables

Generally saying, the counts 𝑎 and 𝑑 can be seen as matches, the counts 𝑏 and 𝑐 as mismatches

While 𝑏 and 𝑐 can be seen as equivalent, this is certainly not true for the matching states 𝑎 and 𝑑


Similarity Measures for Binary Variables

39

Matching Coefficient

Jaccard coefficient

When the presence of a feature has the same “explanatory power” as the absence, the Matching Coefficient is applied, otherwise the Jaccard coefficient


Similarity Measures for Categorical Data

40

A straightforward way would be treating each level of the categorical variable as own binary variable and apply the known measures Let’s say the variable eye-color ∈ {blue, brown, green, gray}

Can be converted into binary variables “has blue eyes”, “has brown eyes”, ...

Problem: By default, many negative matches

Therefore: It is often counted how often two objects agree on the different variables


Proximity Calculation: Specialized Functions

41

These Standard Methods are often not sufficient for biological data, as we neither have categorical data of an embedding in a n-dimensional space

How to embed a sequence? A network? A Protein structure?

Specialized Measures:

BLAST

Network Edit Distance

Protein structure alignments


Clustering

42


From A Criteria to Algorithm

43

Each clustering tool optimizes some inherent idea of a perfect clustering

They are all only approximations!

Possibilities to separate 𝑛 objects into 𝑘 clusters:

𝑁 2,5 = 15

𝑁 10,3 = 9330

𝑁 50,4 = 5.3 ⋅ 1028

𝑁 100,5 = 6.6 ⋅ 1067

There are estimated 4 ⋅ 1080±1 atoms in the observable universe


From A Criteria to Algorithm

44

Each clustering tool optimizes some inherent idea of a perfect clustering

They are all only approximations!

Possibilities to separate 𝑛 objects into 𝑘 clusters:

𝑁 2,5 = 15

𝑁 10,3 = 9330

𝑁 50,4 = 5.3 ⋅ 1028

𝑁 100,5 = 6.6 ⋅ 1067

There are estimated 4 ⋅ 1080±1 atoms in the observable universe

It is important to know what exactly the clustering algorithm optimizes!


Tool Selection: Overview

45

k-Means based

Hierarchical

Graph-based

Density-based

Others


Tool Selection: k-Means based

46

Most popular clustering tool

Two-step iterative process: Assign objects to closest centers

Updates these centers

Good time complexity (almost linear)

Minimizes the mean-squared-errorof the objects to the cluster centers

Works quite well in practice


Tool Selection: Problems with k-means

47

Sensitive to initialization: how do we choose the initial partitions?



48

Sensitive to initialization: how do we choose the initial partitions?

Run several iterations

(Subset) Furthest-first initialization



49

k-Means prefers hyperspherical clusters of approximately the same size

Image taken from wikipedia.com


How to find the best 𝑘?

50

No easy answer to that

Employ domain knowledge

Use internal cluster validity indices

Use GAP statistic


Tool Selection: Hierarchical

51

Creates a hierarchical embeddingof the clustering

Two main branches Agglomerative

Divisive

Image: Brazma, Alvis, and Jaak Vilo. "Gene expression data analysis." FEBS letters 480.1 (2000): 17-24.


Tool Selection: Single Linkage

52

The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.


Tool Selection: Complete Linkage

53

The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters


Tool Selection: Average Linkage

54

The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters

Compromise between Single and Complete Link

Strengths Less susceptible to noise and outliers

Limitations Biased towards spherical clusters



55

k-Means based

Hierarchical

Graph-based Represent the data as a graph

Identifying densely connectedareas in the graph

Examples: MCL, Transitivity Clustering,Affinity Propagation

Used for: Network and Complexanalysis

Density-based

… Images: Vlasblom, James, and Shoshana J. Wodak. "Markov clustering versus affinity propagation for the partitioning of protein

interaction graphs." BMC bioinformatics 10.1 (2009): 1.



56

k-Means based

Hierarchical

Graph-based

Density-based separating high-density areas

from low-density areas

Very Efficient

Arbitrary cluster shape

Require embedding of the objects

…


Cluster Evaluation

57


Evaluate a Clustering

58

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this

direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

No puppets were harmed in the production of this lecture; generally, the usage of black magic is limited to a minimum at SDU.


Overview of Cluster Validation

59

Two different kinds of measures can be distinguished

External Measures

Compare two clusterings

Use a gold-standard to evaluate the quality of a clustering

Internal Measures

Only use the clustering as basis for evaluation

Comparable to cluster criteria


External Measures

60

We can look at each pair of points and define

Or map each cluster 𝑐𝑗 to the gold standard cluster 𝑘𝑖 with the

highest overlap

TP if 𝑎 ∈ 𝑘𝑖 ∧ 𝑎 ∈ 𝑐𝑗

FP if 𝑎 ∉ 𝑘𝑖 ∧ 𝑎 ∈ 𝑐𝑗

FN if 𝑎 ∈ 𝑘𝑖 ∧ 𝑎 ∉ 𝑐𝑗


Now we can define Measures

61

Rand Index (pair-wise)

Jaccard Index (pair-wise)

F-measure (mapping)


Internal Measures

62

Do not have additional information of the ground truth at disposal

Similar to cluster criteria

Normally based on:

Compactness: this measures how closely related the objects in a cluster are

Separation: this measures how distinct or well-separated a cluster is from other clusters


Internal Measures

63

Dunn Index The Dunn Index assesses the clustering performance by relating the

maximal cluster diameter to the minimal distance between clusters

This measure is prone to outliers for it is based on minimal and maximal distances

Davis Bouldin Index The Davies Bouldin Index DB is defined based on the average distances

between objects and their cluster centroids


Silhouette Coefficient

64

Based on: Cohesion 𝑎(𝑥): average with-

in cluster distance of 𝑥

Separation 𝑏(𝑥): average distance of 𝑥 to the closest other cluster

Takes values between -1 and 1


What to do?

65

Such a method does not exists for all use cases How to proceed then?

Is there a general rule we could follow?

ClustEval: Fully automatizes the clustering

We tested 13 clustering methods

On 24 datasets (12 real-world, 12 artificial)

13 common validity measures

1000 parameter sets per tool per dataset


Results

66




Results of ClustEval

67


There is no general best performer among the tools

Quite often internal and external measures do not agree on the performance assessment

When using only biomedical datasets, the Silhouette Value has the best agreement with external measures



Workshop Introduction

68


BreathOMICS data

69


What is it good for?

70

0 120 240 360 480 600 720

0.15

0.30

0.45

0.60

0.75

0.90

3-Pentanone

Monomer

& Dimer

-Graph314.04.2013 12:38

© J.I. Baumbach - B&S Analytik, Dortmund, Germany

Ave

rag

ed

Y

Zeitskala einzeln normiert / a.u.

SHAM S

CLI S


Data Preprocessing

71

RAW Smoothed De-noised


Peak Detection

72

- Local maxima search (LMS)

- Merged peak cluster localization (MPCL) Bader et al. 2008

- Wavelet-based multi-scale peak detection Bader et al. 2008

- Water shed transformation (WST) Bunkowski et al. 2011

- Peak model estimation (PME) Kopczynski et al. 2012


BreathOMICS

73

Substances

Pati

ents


Peak Alignment

74


Resources

75

Go to


You will find The dataset

These slides

An overview paper

A small R script for a cluster Analysis

R Tutorial

R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300



Thank you for your Attention

Q & A

Contact:[email protected]

Documents

CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017