76
Clustering of Biomedical Data 2017 Richard Röttger 1 CBioVikings Copenhagen February 2 nd , 2017

CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Richard Röttger

1

CBioVikingsCopenhagen

February 2nd, 2017

Page 2: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Who is talking?

2

Page 3: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Resources

3

Go to

http://imada.sdu.dk/~roettger/teaching/cbiovikings.php

You will find The dataset

These slides

An overview paper

A small R script for a cluster Analysis

R Tutorial

R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300

Page 4: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Clustering in Life Sciences

4

Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’

such that those within each cluster are more closely related to one another than objects assigned to different clusters.“

Image taken from: http://clusteval.sdu.dk/

Page 5: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Clustering in Life Sciences

5

Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’

such that those within each cluster are more closely related to one another than objects assigned to different clusters.“

Applied in almost every scientific field, e.g.: Information retrieval

Economics and marketing

Astronomy

Bioinformatics

Image taken from: http://clusteval.sdu.dk/

Page 6: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Clustering in Life Sciences

6

Long-standing problem in computer science “grouping or segmenting a collection of objects into subsets or ’clusters’

such that those within each cluster are more closely related to one another than objects assigned to different clusters.“

Applied in almost every scientific field, e.g.: Information retrieval

Economics and marketing

Astronomy

Bioinformatics

In Bioinformatics Homology detection

Gene expression study

Protein complex prediction

Image taken from: http://clusteval.sdu.dk/

Page 7: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Complexity of Clustering

7

Page 8: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Complexity of Clustering

8

Most Pressing Issues:

• What tool to use?

• How to find a best clustering?

• How to tune the parameters of a tool?

• How measure to do this in a reliable and reproducible manner?

Page 9: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Graphical Analysis

9

Page 10: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

First, let’s have a look!

10

Good way to gain an overview

Histograms and Scatterplots

Can be misleading

Hard to automatize

Page 11: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Scatterplots

11

How many clusters do you see?

This is so-called overplotting.

Only meaningful for bivariate data

Page 12: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Density Estimation

12

We have seen that we have a couple of problems Overplotting

Wrong bin size can easily hide interesting features

Now, let’s consider a different approach Assume that our dataset originates from some probability density function

If we would know the type and specifics of this density function, we would have all the information we need for a clustering

BUT: We do not have this information!

And we do not want to make any assumption (i.e., that is the so-called non-parametric density estimation)

Page 13: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Histogram as a Density Estimate

13

Divide the sample space into a number of bins

Approximate the density at the center of each bin by counting

Page 14: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Drawbacks of a Histogram

14

The density estimate depends on the starting position of the bins

Discontinuities are not due to the underlying density;

Curse of dimensionality: number of bins grows exponentially with the number of dimensions In high dimensions many examples are needed in order to have non-empty

bins

Therefore: Unsuitable for high dimensions

More sophisticated density estimators required

Page 15: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Kernel Density Estimators; Parzen Windows

15

We can estimate a density function by employing a kernel function K:

Notice how the Parzen window estimate resembles the histogram, with the exception that the bin locations are determined by the data

Page 16: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Different Kernel Functions

16

Page 17: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Revisiting our Example

17

Page 18: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Pre-Processing

18

Page 19: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Preprocessing: Feature Extraction / Selection

19

Observation Features might be correlated

Features might useless for a clustering

Features might even be blurring the cluster structure

Feature Selection Utilizes only a subset of the available Features

Most methods are coupled with a mining tool to determine optimality

Feature Extraction Creates new features out of the existing features

Seeks to create uncorrelated, better features

Examples: PCA, PCoA

Page 20: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Preprocessing: Normalization

20

Feature1: [0,1] Feature2: [1000,80000]

Normalization: Bring both features to [0,1]

Page 21: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Preprocessing: Normalization

21

Feature1: [0,1] Feature2: [1000,80000]

Normalization: Bring both features to [0,1]

=> Bad with outliers!

Page 22: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Preprocessing: Normalization

22

Feature1: [0,1] Feature2: [1000,80000]

Normalization: Bring both features to [0,1]

=> Bad with outliers!

Standardization: The values are scaled by the deviation from the mean:

Page 23: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Preprocessing: Normalization

23

Feature1: [0,1] Feature2: [1000,80000]

Normalization: Bring both features to [0,1]

=> Bad with outliers!

Standardization: The values are scaled by the deviation from the mean:

Generally: Loss of scale and location!

Page 24: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

PCA

24

PCA is a very complex and large topic which can basically fill entire lecture series

Furthermore, there are many interpretations and different applications for a PCA1

Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space

With the intention of simplify clustering

Hopefully provides a better means for visual inspection

see for example: https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/

Page 25: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

PCA

25

PCA is a very complex and large topic which can basically fill entire lecture series

Furthermore, there are many interpretations and different applications for a PCA1

Here, we limit ourselfs to the usage of PCA in clustering: Project data to a lower dimensional space

With the intention of simplify clustering

Hopefully provides a better means for visual inspection

The task of a PCA is to perform a dimensionality reduction in such a way that most of the variance in the original data is preserved

see for example: https://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/

Page 26: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

An Example

26

Page 27: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

An Example

27

Page 28: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

How does a PCA work?

28

The PCA performs a basis transformation, in which the first basis vector is the vector accounting for most of the variance in the dataset, the second for the most of the remaining variance and so on ...

These basis vectors can be found by the eigenvalue decomposition of the covariance matrix 𝑄 or the sample correlation matrix 𝑅.

The eigenvalues 𝜆1, … , 𝜆𝑑 indicate the variance of the eigenvectors 𝑦1, … , 𝑦𝑑

Page 29: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

The Co-Variance Matrix

29

The co-variance is defined as

This is the observed covariance for 𝑛 observations 𝑥𝑖 , 𝑦𝑖

The co-variance matrix is then defined as

The covariance matrix generalizes the notion of variance to multiple dimensions

Page 30: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Example: PCA

30

Page 31: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Example: PCA

31

Page 32: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data201732

Image taken from Ricardo Gutierrez-Osuna’s class on Pattern Analysis

Page 33: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Proximity Calculation

33

Page 34: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Different Proximity Measures

34

Similarity Numerical measure of how alike two data objects are

Is higher when objects are more alike

Often falls in the range [0, 1]

Dissimilarity Numerical measure of how different two data objects are

Lower when objects are more alike

Minimum dissimilarity is often 0

Upper limit varies

Often called Distance if it fulfills metric properties

Page 35: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

One-mode / Two-mode

35

One-mode

In a one-mode dataset the data is given in a 𝑛 × 𝑛-matrix 𝑃 = (𝑝𝑖𝑗)

𝑝𝑖𝑗 relates each pair of objects 𝑥𝑖 and 𝑥𝑗 with each other

Also often called a similarity/dissimilarity Matrix

Normally, a one-mode matrix is symmetric

Called one-mode as columns and rows describe the same thing

Two-mode A two-mode dataset normally comes as a 𝑛 × 𝑑-matrix

Each object is in a row, with each property being stored in a different column

Sometimes, this mode is called the Raw-data

A row is also sometimes called a feature vector

Page 36: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Proximity Calculation: Continues Data

36

Euclidean Type of measures (Minkowski Distance)

Image taken from wikipedia.com

Page 37: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Proximity Calculation: Continues Data

37

Correlation Coefficient

Page 38: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Similarity Measures for Binary Variables

38

Most of the measure define a similarity on the count of different mismatches of two objects in the 𝑑 variables

Generally saying, the counts 𝑎 and 𝑑 can be seen as matches, the counts 𝑏 and 𝑐 as mismatches

While 𝑏 and 𝑐 can be seen as equivalent, this is certainly not true for the matching states 𝑎 and 𝑑

Page 39: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Similarity Measures for Binary Variables

39

Matching Coefficient

Jaccard coefficient

When the presence of a feature has the same “explanatory power” as the absence, the Matching Coefficient is applied, otherwise the Jaccard coefficient

Page 40: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Similarity Measures for Categorical Data

40

A straightforward way would be treating each level of the categorical variable as own binary variable and apply the known measures Let’s say the variable eye-color ∈ {blue, brown, green, gray}

Can be converted into binary variables “has blue eyes”, “has brown eyes”, ...

Problem: By default, many negative matches

Therefore: It is often counted how often two objects agree on the different variables

Page 41: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Proximity Calculation: Specialized Functions

41

These Standard Methods are often not sufficient for biological data, as we neither have categorical data of an embedding in a n-dimensional space

How to embed a sequence? A network? A Protein structure?

Specialized Measures:

BLAST

Network Edit Distance

Protein structure alignments

Page 42: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Clustering

42

Page 43: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

From A Criteria to Algorithm

43

Each clustering tool optimizes some inherent idea of a perfect clustering

They are all only approximations!

Possibilities to separate 𝑛 objects into 𝑘 clusters:

𝑁 2,5 = 15

𝑁 10,3 = 9330

𝑁 50,4 = 5.3 ⋅ 1028

𝑁 100,5 = 6.6 ⋅ 1067

There are estimated 4 ⋅ 1080±1 atoms in the observable universe

Page 44: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

From A Criteria to Algorithm

44

Each clustering tool optimizes some inherent idea of a perfect clustering

They are all only approximations!

Possibilities to separate 𝑛 objects into 𝑘 clusters:

𝑁 2,5 = 15

𝑁 10,3 = 9330

𝑁 50,4 = 5.3 ⋅ 1028

𝑁 100,5 = 6.6 ⋅ 1067

There are estimated 4 ⋅ 1080±1 atoms in the observable universe

It is important to know what exactly the clustering algorithm optimizes!

Page 45: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Overview

45

k-Means based

Hierarchical

Graph-based

Density-based

Others

Page 46: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: k-Means based

46

Most popular clustering tool

Two-step iterative process: Assign objects to closest centers

Updates these centers

Good time complexity (almost linear)

Minimizes the mean-squared-errorof the objects to the cluster centers

Works quite well in practice

Page 47: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Problems with k-means

47

Sensitive to initialization: how do we choose the initial partitions?

Page 48: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Problems with k-means

48

Sensitive to initialization: how do we choose the initial partitions?

Run several iterations

(Subset) Furthest-first initialization

Page 49: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Problems with k-means

49

k-Means prefers hyperspherical clusters of approximately the same size

Image taken from wikipedia.com

Page 50: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

How to find the best 𝑘?

50

No easy answer to that

Employ domain knowledge

Use internal cluster validity indices

Use GAP statistic

Page 51: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Hierarchical

51

Creates a hierarchical embeddingof the clustering

Two main branches Agglomerative

Divisive

Image: Brazma, Alvis, and Jaak Vilo. "Gene expression data analysis." FEBS letters 480.1 (2000): 17-24.

Page 52: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Single Linkage

52

The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters.

Page 53: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Complete Linkage

53

The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters

Page 54: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Average Linkage

54

The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters

Compromise between Single and Complete Link

Strengths Less susceptible to noise and outliers

Limitations Biased towards spherical clusters

Page 55: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Overview

55

k-Means based

Hierarchical

Graph-based Represent the data as a graph

Identifying densely connectedareas in the graph

Examples: MCL, Transitivity Clustering,Affinity Propagation

Used for: Network and Complexanalysis

Density-based

… Images: Vlasblom, James, and Shoshana J. Wodak. "Markov clustering versus affinity propagation for the partitioning of protein

interaction graphs." BMC bioinformatics 10.1 (2009): 1.

Page 56: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Tool Selection: Overview

56

k-Means based

Hierarchical

Graph-based

Density-based separating high-density areas

from low-density areas

Very Efficient

Arbitrary cluster shape

Require embedding of the objects

Page 57: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Cluster Evaluation

57

Page 58: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Evaluate a Clustering

58

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this

direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

No puppets were harmed in the production of this lecture; generally, the usage of black magic is limited to a minimum at SDU.

Page 59: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Overview of Cluster Validation

59

Two different kinds of measures can be distinguished

External Measures

Compare two clusterings

Use a gold-standard to evaluate the quality of a clustering

Internal Measures

Only use the clustering as basis for evaluation

Comparable to cluster criteria

Page 60: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

External Measures

60

We can look at each pair of points and define

Or map each cluster 𝑐𝑗 to the gold standard cluster 𝑘𝑖 with the

highest overlap

TP if 𝑎 ∈ 𝑘𝑖 ∧ 𝑎 ∈ 𝑐𝑗

FP if 𝑎 ∉ 𝑘𝑖 ∧ 𝑎 ∈ 𝑐𝑗

FN if 𝑎 ∈ 𝑘𝑖 ∧ 𝑎 ∉ 𝑐𝑗

Page 61: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Now we can define Measures

61

Rand Index (pair-wise)

Jaccard Index (pair-wise)

F-measure (mapping)

Page 62: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Internal Measures

62

Do not have additional information of the ground truth at disposal

Similar to cluster criteria

Normally based on:

Compactness: this measures how closely related the objects in a cluster are

Separation: this measures how distinct or well-separated a cluster is from other clusters

Page 63: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Internal Measures

63

Dunn Index The Dunn Index assesses the clustering performance by relating the

maximal cluster diameter to the minimal distance between clusters

This measure is prone to outliers for it is based on minimal and maximal distances

Davis Bouldin Index The Davies Bouldin Index DB is defined based on the average distances

between objects and their cluster centroids

Page 64: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Silhouette Coefficient

64

Based on: Cohesion 𝑎(𝑥): average with-

in cluster distance of 𝑥

Separation 𝑏(𝑥): average distance of 𝑥 to the closest other cluster

Takes values between -1 and 1

Page 65: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

What to do?

65

Such a method does not exists for all use cases How to proceed then?

Is there a general rule we could follow?

ClustEval: Fully automatizes the clustering

We tested 13 clustering methods

On 24 datasets (12 real-world, 12 artificial)

13 common validity measures

1000 parameter sets per tool per dataset

Page 66: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Results

66

http://clusteval.sdu.dk/

Page 67: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Results of ClustEval

67

http://clusteval.sdu.dk/

There is no general best performer among the tools

Quite often internal and external measures do not agree on the performance assessment

When using only biomedical datasets, the Silhouette Value has the best agreement with external measures

Page 68: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Workshop Introduction

68

Page 69: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

BreathOMICS data

69

Page 70: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

What is it good for?

70

0 120 240 360 480 600 720

0.15

0.30

0.45

0.60

0.75

0.90

3-Pentanone

Monomer

& Dimer

-Graph314.04.2013 12:38

© J.I. Baumbach - B&S Analytik, Dortmund, Germany

Ave

rag

ed

Y

Zeitskala einzeln normiert / a.u.

SHAM S

CLI S

Page 71: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Data Preprocessing

71

RAW Smoothed De-noised

Page 72: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Peak Detection

72

- Local maxima search (LMS)

- Merged peak cluster localization (MPCL) Bader et al. 2008

- Wavelet-based multi-scale peak detection Bader et al. 2008

- Water shed transformation (WST) Bunkowski et al. 2011

- Peak model estimation (PME) Kopczynski et al. 2012

Page 73: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

BreathOMICS

73

Substances

Pati

ents

Page 74: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Peak Alignment

74

Page 75: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data2017

Resources

75

Go to

http://imada.sdu.dk/~roettger/teaching/cbiovikings.php

You will find The dataset

These slides

An overview paper

A small R script for a cluster Analysis

R Tutorial

R Röttger. Clustering of Biological Datasets in the Era of Big Data. Journal of Integrative Bioinformatics 13 (1), 300

Page 76: CBioVikings - SDUroettger/teaching/resources/cbio... · 2019. 10. 20. · 2017 Clustering of Biomedical Data Richard Röttger 1 CBioVikings Copenhagen February 2nd, 2017

Clustering of Biomedical Data201776

Thank you for your Attention

Q & A

Contact:[email protected]