73
Sequence annotation. Cluster analysis. Meelis Kull [email protected] Lecture 5 Bioinformatics MTAT.03.239 University of Tartu 2011 Fall

Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Sequence annotation.Cluster analysis.

Meelis [email protected]

Lecture 5Bioinformatics MTAT.03.239

University of Tartu2011 Fall

Page 2: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Sequence Annotation:Highlighting information in

DNA, RNA, Proteins

Page 3: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

What enables information in biological sequences?

Aperiodicity

NaCl (table salt)

[sodium chloride] DNA

Periodic crystal -no information

Aperiodic crystal -information

Page 4: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

E. Schrödinger 1944:Aperiodic crystals encode genetic information

• Erwin Schrödinger“What is Life? The Physical Aspect of the Living Cell”1944

• In the book, Schrödinger introduced the idea of an "aperiodic crystal" that contained genetic information in its configuration of covalent chemical bonds

• Note that this is before discovery of the helical structure of DNA in 1953 by Watson & Crick

Page 5: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Reading the information

• How is the sequence information read?

• How is it converted into working machinery?

• Why is it important to have A instead of T in some locus?

• Different nucleotides in DNA/RNA and different amino acids in proteins have different physical and chemical properties

Page 6: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Chemical bindingExample: protein-DNA

Protein GCN4 binds on DNA at TGASTCA, (S is G or C)

Page 7: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Transcription regulation:Information processing

DNA

Page 8: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Sequence annotation

• How do the nucleotides in the genome contribute to the life of the cell?

• One of the main objectives in bioinformatics (and biology)

• Sequence annotation is the next major challenge for the Human Genome Project

• ENCyclopedia Of DNA Elements (ENCODE)

Page 9: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Genome (DNA) annotation: UCSC database

Proteome (protein) annotation: UniProt database

Sequence annotation

Page 10: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

How to get annotations for the genome?

• Evidence from case studies (Biology)

• Large scale experiments (Biology with Bioinformatics support)

• Predictions based on existing annotations (Bioinformatics)

• Experimental verification of predictions (Biology)

Page 11: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Finding TFBS - Transcription Factor Binding Sites

DNA

Page 12: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Transcription Factor binding on DNA

Protein GCN4 binds on DNA at TGASTCA, (S is G or C)

Page 13: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Substrings, k-mers, k-lets

• GCN4 prefers to bind on 7-mers TGACTCA and TGAGTCA

Terminology:

• 1-mer, monomer, nucleotide, base

• 2-mer, dimer, duplet, dinucleotide

• 3-mer, trimer, triplet, trinucleotide

• ...

• k-mer, k-let, substring/subsequence of length k

Page 14: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

IUPAC ambiguity codes

Page 15: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Transcription factor binding site examples

Page 16: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Models for aDNA sequence

• Consensus sequence

• k-mer

• IUPAC k-mer

• RegExp - Regular expression

• e.g. CG[GT]N{5,10}CCG

• PWM - Position Weight Matrix

• HMM - Hidden Markov Model

Page 17: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

PWM - Position Weight Matrix• TBP - TATA-binding protein

Binding TATAWAW where W is T or A

A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ]C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ]G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ]T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]-----------------------------------------------------------------SUM 389 389 389 389 389 389 389 389 389 389 389 389 389 389 389

PWM logoCount matrix,

not PWM

Page 18: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Count matrix to PWM (1)

• Count matrixA [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ]C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ]G [152 18 2 2 5 0 20 44 157 150 128 128 128 139 140 ]T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]-----------------------------------------------------------------SUM 389 389 389 389 389 389 389 389 389 389 389 389 389 389 389

for any i = 1, 2, . . . , nS = cAi + cCi + cGi + cTi

cA1 cA2 . . . cAn

cC1 cC2 . . . cCn

cG1 cG2 . . . cGn

cT1 cT2 . . . cTn

Page 19: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Count matrix to PWM (2)

• Probability matrixA [ .16 .05 .87 .02 .88 .67 .89 .56 .39 .15 .22 .21 .21 .18 .20 ]C [ .37 .12 .01 .04 .01 .01 .02 .02 .12 .34 .37 .32 .30 .27 .26 ]G [ .38 .06 .02 .02 .02 .01 .06 .12 .40 .38 .33 .33 .33 .35 .35 ]T [ .09 .77 .10 .93 .09 .31 .03 .31 .09 .13 .09 .14 .16 .20 .19 ]-----------------------------------------------------------------SUM 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

for anyi = 1, 2, . . . , n

X = A,C,G,T

pA1 pA2 . . . pAn

pC1 pC2 . . . pCn

pG1 pG2 . . . pGn

pT1 pT2 . . . pTn

bAbCbGbT

pXi =cXi + bX

√S

S +√S

Page 20: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Count matrix to PWM (3)

• Position weight matrixA [ -0.6 -2.3 +1.8 -3.7 +1.8 +1.4 +1.8 +1.2 +0.6 -0.7 -0.2 -0.2 -0.2 -0.5 -0.3 ]C [ +0.6 -1.0 -4.4 -2.8 -4.4 -4.4 -3.7 -3.9 -1.1 +0.5 +0.6 +0.4 +0.3 +0.1 +0.1 ]G [ +0.6 -2.2 -3.9 -3.9 -3.4 -4.4 -2.0 -1.1 +0.7 +0.6 +0.4 +0.4 +0.4 +0.5 +0.5 ]T [ -1.5 +1.6 -1.4 +1.9 -1.5 +0.3 -3.2 +0.3 -1.4 -0.9 -1.5 -0.8 -0.6 -0.4 -0.4 ]

for anyi = 1, 2, . . . , n

X = A,C,G,T

wA1 wA2 . . . wAn

wC1 wC2 . . . wCn

wG1 wG2 . . . wGn

wT1 wT2 . . . wTn

wXi = log2pXi

bX

Page 21: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Matching a PWM on DNA

• Position weight matrix

A [ -0.6 -2.3 +1.8 -3.7 +1.8 +1.4 ]C [ +0.6 -1.0 -4.4 -2.8 -4.4 -4.4 ]G [ +0.6 -2.2 -3.9 -3.9 -3.4 -4.4 ]T [ -1.5 +1.6 -1.4 +1.9 -1.5 +0.3 ]

C C T T A T +0.6 -1.0 -1.4 +1.9 +1.8 +0.3 = +2.2

Sequence:Score:

PWM of length k gives a score to each k-mer in the DNA

Page 22: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Information content in PWM

• Information content (IC) - how different is a position in PWM from uniform distribution

ICi = log2 4−�

X=A,C,G,T

−pXi log2(pXi)

ICi

i

Page 23: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Observations: Sunny, Rainy, Sunny, Sunny, Rainy, Snowy, Snowy, ...Hidden states: Summer, Fall, Fall, Fall, Fall, Winter, Winter,...

HMMs - Hidden Markov Models

http://webdocs.cs.ualberta.ca/~colinc/cmput606/606FinalPres.ppt

Page 24: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Tasks with HMMsScoring task:

• Given an existing HMM and observed sequence, what is the probability that the HMM generates the sequence

Alignment task:

• Given a sequence, what is the optimal state sequence that the HMM would use to generate it

Training task:

• Given a large amount of data how can we estimate the structure and the parameters of the HMM that best accounts for the data

Page 25: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hidden Markov Models in Sequence Annotation

ACA---ATG TCAACTATCACAC--AGCAGA---ATCACCG--ATC

http://www.evl.uic.edu/shalini/coursework/hmm.ppt

ACANNNATC

ACA[ACGT]{0,3}ATC

Task: represent the following sequences in a model IUPAC consensus:

Regular expression:PWMHMM

Page 26: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

HMMs in gene finding• S - gene start

• D - donor splice site

• E - exon

• I - intron

• A - acceptor splice site

• T - gene termination

Page 27: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

HMMs in proteome annotation

HMM logo of hEGF domainfrom .

Page 28: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Common tasks in Genome annotation

• building a model for some DNA feature (or often just learning the parameters)

• searching for sites which match a model (or match it well enough)

• statistical analysis about what features are over-represented in some region of DNA

Page 29: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Common annotated features

• Genes, Transcripts, Exons, Introns

• TFBSs

• CpG islands

• Repeats

• and many more

Page 30: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Clustering

Page 31: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

What is Cluster Analysis?

Finding groups of objects such that the objects are:

• similar (or related) to the objects in the same group and

• different from (or unrelated) to the objects in other groups

Short distance

Long distance

Page 32: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Why to cluster biological data?

• Intuition building

• Hypothesis generation

• Summarizing / compressing large data

Page 33: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Partitional vs Hierarchical

Partitional clustering findsa fixed number of clusters

Hierarchical clustering creates a series of clusterings contained in each other

Page 34: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Fuzzy vs Non-Fuzzy

Each object belongs to eachcluster with some weight(the weight can be zero)

Each object belongs to exactly one cluster

Page 35: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering

Hierarchical clustering is usually depicted as a dendrogram (tree)

Page 36: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering

• Each subtree corresponds to a cluster• Height of branching shows distance

Page 37: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (0)

Algorithm for Agglomerative Hierarchical Clustering:Join the two closest objects

Page 38: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (1)

Join the two closest objects

Page 39: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (2)

Keep joining the closest pairs

Page 40: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (3)

Keep joining the closest pairs

Page 41: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (4)

Keep joining the closest pairs

Page 42: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (5)

Keep joining the closest pairs

Page 43: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (10)

After 10 steps we have 4 clusters left

Page 44: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN)

Page 45: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)

Page 46: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (10)Several ways to measure distancebetween clusters:• Single linkage (MIN) • Complete linkage (MAX)• Average linkage• Weighted• Unweighted• ...

Page 47: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (11)

In this example and at this stage we have the same result as in partitional clustering

Page 48: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (12)

In the final step the two remaining clusters are joined into a single cluster

Page 49: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Hierarchical clustering (13)

In the final step the two remaining clusters are joined into a single cluster

Page 50: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Examples of Hierarchical Clustering in Bioinformatics

PhylogenyGene expression clustering

Page 51: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

• Partitional, non-fuzzy

• Partitions the data into K clusters

• K is given by the user

Algorithm:

• Choose K initial centers for the clusters

• Assign each object to its closest center

• Recalculate cluster centers

• Repeat until converges

K-means clustering

Page 52: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-means (1)

Page 53: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-means (2)

Page 54: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-means (3)

Page 55: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-means (4)

Page 56: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-means (5)

Page 57: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-means (6)

Page 58: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

• One of the fastest clustering algorithms

• Therefore very widely used

• Sensitive to the choice of initial centres

• many algorithms to choose initial centres cleverly

• Assumes that the mean can be calculated

• can be used on vector data

• cannot be used on sequences (what is the mean of A and T?)

K-means clustering

Page 59: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Distance measuresDistance of vectors and

• Euclidean distance

• Manhattan distance

• Correlation distance

Distance of sequences and

• Hamming distance => 3

• Levenshtein distance

x = (x1, . . . , xn) y = (y1, . . . , yn)

d(x, y) =

����n�

i=1

(xi − yi)2

d(x, y) =n�

i=1

|xi − yi|

d(x, y) = 1− r(x, y)is Pearson

correlation coefficientr(x, y)

ACCTTG TACCTGACCTTGTACCTG

.ACCTTGTACC.TG => 2

Page 60: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

• The same as K-means, except that the center is required to be at an object

• Medoid - an object which has minimal total distance to all other objects in its cluster

• Can be used on more complex data, with any distance measure

• Slower than K-means

K-medoids clustering

Page 61: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (1)

Page 62: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (2)

Page 63: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (3)

Page 64: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (4)

Page 65: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (5)

Page 66: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (6)

Page 67: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (7)

Page 68: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (8)

Page 69: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

K-medoids (9)

Page 70: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Examples of K-means and K-medoids in Bioinformatics

Gene expression clustering

Sequence clustering

Page 71: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Applications of sequence clustering• Clustering genes and proteins into families

• UniRef

• Pfam

• Clustering transcription factor binding motifs into families

• Jaspar

Page 72: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

MCL - Markov CLuster• A method for clustering a graph by flow simulation

• Based on random walks on graphs:a random walk infrequently goes from one cluster to another

• For example, used for clustering proteins by structural similarity

Page 73: Sequence annotation. Cluster analysis.courses.cs.ut.ee/2011/Bioinformatics/uploads/Main/annotation_and_clustering.pdf• 1-mer, monomer, nucleotide, base • 2-mer, dimer, duplet,

Summary of Clustering• Aims: intuition, hypothesis generation, summarization

• Types:

• Hierarchical/Partitional

• Fuzzy/Non-Fuzzy

• Vector-based/Distance-based

• etc.

• Distance measures

• Euclidean, Manhattan, Correlation

• Hamming, Levenshtein

• etc.

• Applications:

• Clustering genes, sequences, organisms, etc.