25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data...

25/05/2004

Evolution/Phylogeny/Pattern recognition

Bioinformatics Master Course

Bioinformatics Data Analysis and Tools

PatternsSome are easy some are not

• Knitting patterns

• Cooking recipes

• Pictures (dot plots)

• Colour patterns

• Maps

Example of algorithm reuse: Data clustering

• Many biological data analysis problems can be formulated as clustering problems– microarray gene expression data analysis– identification of regulatory binding sites (similarly, splice

junction sites, translation start sites, ......)– (yeast) two-hybrid data analysis (for inference of protein

complexes)– phylogenetic tree clustering (for inference of horizontally

transferred genes)– protein domain identification– identification of structural motifs– prediction reliability assessment of protein structures– NMR peak assignments – ......

Data Clustering Problems

• Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”

• cluster identification -- identifying clusters with significantly different features than the background

Application Examples

• Regulatory binding site identification: CRP (CAP) binding site

• Two hybrid data analysis Gene expression data analysis

Are all solvable by the same algorithm!

Other Application Examples

• Phylogenetic tree clustering analysis (Evolutionary trees)

• Protein sidechain packing prediction

• Assessment of prediction reliability of protein structures

• Protein secondary structures

• Protein domain prediction

• NMR peak assignments

• ……

Multivariate statistics – Cluster analysis

Dendrogram

Scores

Similaritymatrix

C1 C2 C3 C4 C5 C6 ..

Raw table

Similarity criterion

Cluster criterion

Human Evolution

Comparing sequences - Similarity Score -

Many properties can be used:

• Nucleotide or amino acid composition

• Isoelectric point

• Molecular weight

• Morphological characters

• But: molecular evolution through sequence alignment

Multivariate statistics – Cluster analysisNow for sequences

Phylogenetic tree

Scores

Similaritymatrix

Multiple sequence alignment

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Dendrogram/tree

Scores

Similaritymatrix

C1 C2 C3 C4 C5 C6 ..

Data table

Cluster criterion

Why do it?• Finding a true typology• Model fitting• Prediction based on groups• Hypothesis testing• Data exploration• Data reduction• Hypothesis generation But you can never prove a

classification/typology!

Cluster analysis – data normalisation/weighting

C1 C2 C3 C4 C5 C6 ..

Raw table

Normalisation criterion

C1 C2 C3 C4 C5 C6 ..

Normalised table

Column normalisation x/max

Column range normalise (x-min)/(max-min)

Cluster analysis – (dis)similarity matrix

Scores

Similaritymatrix

C1 C2 C3 C4 C5 C6 ..

Raw table

Di,j = (k | xik – xjk|r)1/r Minkowski metrics

r = 2 Euclidean distancer = 1 City block distance

Cluster analysis – Clustering criteria

Dendrogram (tree)

Scores

Similaritymatrix

Cluster criterion

Single linkage - Nearest neighbour

Complete linkage – Furthest neighbour

Group averaging – UPGMA

Neighbour joining – global measure

1. Start with N clusters of 1 object each

2. Apply clustering distance criterion iteratively until you have 1 cluster of N objects

3. Most interesting clustering somewhere in between

Dendrogram (tree)

distance

N clusters1 cluster

Single linkage clustering (nearest neighbour)

Char 1

Char 2

Char 1

Char 2

Char 1

Char 2

Char 1

Char 2

Char 1

Char 2

Char 1

Char 2

Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Cluster analysis – Ward’s clustering criterion

Per cluster: calculate Error Sum of Squares (ESS)

ESS = x2 – (x)2/n

calculate minimum increase of ESS

Suppose:

Obj Val c l u s t e r i n g ESS

1 1 1 2 3 4 5 0

2 2 1 2 3 4 5 0.5

3 7 1 2 3 4 5 2.5

4 9 1 2 3 4 5 13.1

5 12 1 2 3 4 5 86.8

Phylogenetic tree

Scores

Similaritymatrix

C1 C2 C3 C4 C5 C6 ..

Data table

Cluster criterion

Scores

C1 C2 C3 C4 C5 C6

Cluster criterion

Scores

Cluster criterion

Make two-way ordered

table using dendrograms

Multivariate statistics – Principal Component Analysis (PCA)

C1 C2 C3 C4 C5 C6 Similarity Criterion:Correlations

Calculate eigenvectors with greatest eigenvalues:

•Linear combinations

•Orthogonal

Correlations

Project datapoints ontonew axes (eigenvectors)

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in bioinformatics makes sense except in the light of Biology”

Bioinformatics

Evolution

• Most of bioinformatics is comparative biology

• Comparative biology is based upon evolutionary relationships between compared entities

• Evolutionary relationships are normally depicted in a phylogenetic tree

Where can phylogeny be used

• For example, finding out about orthology versus paralogy

• Predicting secondary structure of RNA

• Studying host-parasite relationships

• Mapping cell-bound receptors onto their binding ligands

• Multiple sequence alignment (e.g. Clustal)

Phylogenetic tree (unrooted)

mousefugu

Drosophila

internal node

OTU – Observed taxonomic unit

Phylogenetic tree (unrooted)

mousefugu

Drosophila

internal node

Phylogenetic tree (rooted)

fuguDrosophila

internal node (ancestor)

How to root a tree

• Outgroup – place root between distant sequence and rest group

• Midpoint – place root at midpoint of longest path (sum of branches between any two OTUs)

• Gene duplication – place root between paralogous gene copies

h D f m h

h- f- h- f- h-

Combinatoric explosion

# sequences # unrooted # rooted trees trees

2 1 13 1 34 3 155 15 1056 105 9457 945 10,3958 10,395 135,1359 135,135 2,027,02510 2,027,025 34,459,425

Tree distances

human x

mouse 6 x

fugu 7 3 x

Drosophila 14 10 9 x

Drosophila

6human

fuguDrosophila

Evolutionary (sequence distance) = sequence dissimilarity

Phylogeny methods

• Parsimony – fewest number of evolutionary events (mutations) – relatively often fails to reconstruct correct phylogeny

• Distance based – pairwise distances

• Maximum likelihood – L = Pr[Data|Tree]

Parsimony & DistanceSequences 1 2 3 4 5 6 7Drosophila t t a t t a a fugu a a t t t a a mouse a a a a a t a human a a a a a a t

human x

mouse 2 x

fugu 3 4 x

Drosophila 5 5 3 x

fuguDrosophila

Drosophila

parsimony

distance

Maximum likelihood

• If data=alignment, hypothesis = tree, and under a given evolutionary model,

maximum likelihood selects the hypothesis (tree) that maximises the observed data

• Extremely time consuming method

• We also can test the relative fit to the tree of different models (Huelsenbeck & Rannala, 1997)

Bayesian methods

• Calculates the posterior probability of a tree (Huelsenbeck et al., 2001) –- probability that tree is true tree given evolutionary model

• Most computer intensive technique• Feasible thanks to Markov chain Monte Carlo

(MCMC) numerical technique for integrating over probability distributions

• Gives confidence number (posterior probability) per node

Distance methods: fastest

• Clustering criterion using a distance matrix

• Distance matrix filled with alignment scores (sequence identity, alignment scores, E-values, etc.)

• Cluster criterion

Phylogenetic tree by Distance methods (Clustering)

Phylogenetic tree

Scores

Similaritymatrix

Multiple alignment

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQLamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ

Lactate dehydrogenase multiple alignment

Distance Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 1 Human 0.000 0.112 0.128 0.202 0.378 0.346 0.530 0.551 0.512 0.524 0.528 0.635 0.637 2 Chicken 0.112 0.000 0.155 0.214 0.382 0.348 0.538 0.569 0.516 0.524 0.524 0.631 0.651 3 Dogfish 0.128 0.155 0.000 0.196 0.389 0.337 0.522 0.567 0.516 0.512 0.524 0.600 0.655 4 Lamprey 0.202 0.214 0.196 0.000 0.426 0.356 0.553 0.589 0.544 0.503 0.544 0.616 0.669 5 Barley 0.378 0.382 0.389 0.426 0.000 0.171 0.536 0.565 0.526 0.547 0.516 0.629 0.575 6 Maizey 0.346 0.348 0.337 0.356 0.171 0.000 0.557 0.563 0.538 0.555 0.518 0.643 0.587 7 Lacto_casei 0.530 0.538 0.522 0.553 0.536 0.557 0.000 0.518 0.208 0.445 0.561 0.526 0.501 8 Bacillus_stea 0.551 0.569 0.567 0.589 0.565 0.563 0.518 0.000 0.477 0.536 0.536 0.598 0.495 9 Lacto_plant 0.512 0.516 0.516 0.544 0.526 0.538 0.208 0.477 0.000 0.433 0.489 0.563 0.485 10 Therma_mari 0.524 0.524 0.512 0.503 0.547 0.555 0.445 0.536 0.433 0.000 0.532 0.405 0.598 11 Bifido 0.528 0.524 0.524 0.544 0.516 0.518 0.561 0.536 0.489 0.532 0.000 0.604 0.614 12 Thermus_aqua 0.635 0.631 0.600 0.616 0.629 0.643 0.526 0.598 0.563 0.405 0.604 0.000 0.641 13 Mycoplasma 0.637 0.651 0.655 0.669 0.575 0.587 0.501 0.495 0.485 0.598 0.614 0.641 0.000

Cluster analysis – (dis)similarity matrix

Scores

Similaritymatrix

C1 C2 C3 C4 C5 C6 ..

Raw table

Di,j = (k | xik – xjk|r)1/r Minkowski metrics

r = 2 Euclidean distancer = 1 City block distance

Phylogenetic tree

Scores

Similaritymatrix

Cluster criterion

Single linkage - Nearest neighbour

Complete linkage – Furthest neighbour

Group averaging – UPGMA

Neighbour joining – global measure

Neighbour joining

• Global measure – keeps total branch length minimal, tends to produce a tree with minimal total branch length

• At each step, join two nodes such that distances are minimal (criterion of minimal evolution)

• Agglomerative algorithm

• Leads to unrooted tree

Neighbour joining

(a) (b) (c)

(d) (e) (f)

At each step all possible ‘neighbour joinings’ are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

How to assess confidence in tree

• Bayesian method – time consuming– The Bayesian posterior probabilities (BPP) are assigned to

internal branches in consensus tree – Bayesian Markov chain Monte Carlo (MCMC) analytical software

such as MrBayes (Huelsenbeck and Ronquist, 2001) and BAMBE (Simon and Larget,1998) is now commonly used

– Uses all the data

• Distance method – bootstrap:– Select multiple alignment columns with replacement– Recalculate tree– Compare branches with original (target) tree– Repeat 100-1000 times, so calculate 100-1000 different trees– How often is branching (point between 3 nodes) preserved for

each internal node?– Uses samples of the data

The Bootstrap

1 2 3 4 5 6 7 8 C C V K V I Y SM A V R L I F SM C L R L L F T

3 4 3 8 6 6 8 6 V K V S I I S IV R V S I I S IL R L T L L T L

Original

Scrambled

Non-supportive

25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data...

Documents

Bioinformatics Ali Day Two Sequence Analysisuser.it.uu.se/~kripe367/research3/AliHesham_files/Bioinformatics Ali... · 9/10/12 1 Next%Generation%BioinformaticsTools% Fall2012 Day%2%–Sequence%Analysisand%Phylogeny%

07/05/2004 Evolution/Phylogeny Introduction to Bioinformatics MNW2

SOLAPUR UNIVERSITY, SOLAPUR M.Sc. Part-I Bioinformatics …su.digitaluniversity.ac/WebFiles/M Sc I Bioinformatics.pdf · M.Sc. Part-I Bioinformatics Revised Syllabus (New CBCS Pattern

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics

BMC Bioinformatics BioMed Central - COREBioMed Central Page 1 of 16 (page number not for citation purposes) BMC Bioinformatics ... constructed. A conserved pattern of electrostatics

GEPAS Gene Expression Pattern Analysis Suite Ka-Lok Ng Dept. of Bioinformatics Asia University

Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org

Combinatorial Pattern Matching - Phillip Compeaucompeau.cbd.cmu.edu/.../2016/08/Ch10_Clustering.pdf · Combinatorial Pattern Matching . An Introduction to Bioinformatics Algorithms

Phylogeny and Molecular Evolution - BGUbccg161/wiki.files/phylogeny...• Bioinformatics Algorithms book by Phillip Compeau and Pavel Pvzner –all book photos shown in this lecture

Bioinformatics & Hybrid-Seq - Plant Molecular Phylogeny Labamborella.net/2019-HybSeq-JeonHyungRin/Sungsin-Univ.2nd... · 2019-07-13 · •Bioinformatics is an interdisciplinary field

1-month Practical Course Genome Analysis Evolution and Phylogeny methods Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam

GriPPS Bioinformatics on Grid - LaBRI · GriPPS Bioinformatics on Grid Protein pattern scanning taken as model http:// gripps.ibcp.fr Contact: Christophe.Blanchet@ibcp.fr Protein

Copyright notice Molecular Phylogeny and Evolutionpevsnerlab.kennedykrieger.org/BCMB/2008-12-15_BCMB05...2008/12/15 · Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics

Multiple Sequence Alignment benchmarking, pattern recognition and Phylogeny

Multiple Sequence Alignment benchmarking, pattern recognition and Phylogeny Introduction to bioinformatics 2008 Lecture 11 C E N T R F O R I N T E G R

Michael Schroeder BioTechnological Center TU Dresden Biotec Sequence comparison and Phylogeny based on Chapter 4 Lesk, Introduction to Bioinformatics

Pattern Recognition Introduction to bioinformatics 2005 Lecture 4

Www..uni-rostock.de Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz ulf.schmitz@informatik.uni-rostock.de

Www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Combinatorial Pattern Matching

1 Molecular Phylogeny and Evolution Bioinformatics