108
Comparative Genomics EPIT 2014, Oléron, May 16th 2014 Eric Tannier, INRIA, LBBE, university of Lyon

Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Comparative Genomics

EPIT 2014, Oléron, May 16th 2014

Eric Tannier, INRIA, LBBE, university of Lyon

Page 2: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

What are the documents and the methods to write the history of life on earth (human and non human)?

Page 3: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

200 years

Oral (human) history

Page 4: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Written (human) history

6000 years

Page 5: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Archaeology

4 000 000 years

Page 6: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

PaleontologyComparative anatomy

1 000 000 000 years

Page 7: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Comparative Genomics

> 3 000 000 000 years

More than ¾ of the history on earth is made of microbes that have left no fossils. They represent 9/10 of today's biodiversity.Molecules are by far the most abundant source for this history.

Page 8: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Paleogenomics

Sequencing ancient DNA and proteins

200 year old viruses1000 year old bacteria700 000 year old mammals

Computational prediction of ancestral DNA or proteins

> 3 000 000 000 year old molecules

Page 9: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The principle is, as in comparative anatomy, to compare extant molecules and decipher what part of them is inherited from ancient ancestors, and what part is derivative.

For molecules we can integrate what we know about the processes of genome evolution.

Page 10: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Part 1 : how we learned that chromosomes are linear arrangements of linear genes (with classical rearrangement problems)

Part 2 : how we learned to forget it (with an open problem session)

Part 3 (if we have time) : research results- A 650 year old genome : Yersinia Pestis- how to date without fossils : chronology of cyanobacteria

Outline

Page 11: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Comparative Genomics starts with Mendel laws

Two individuals x1 and x

2, with « factors » (genes), and two values

per factor

x1

x2

x11

x12

x21

x22

(x11

+x12

)(x21

+x22

) = x11

x21

+ x11

x22

+ x12

x21

+ x12

x22

offsping has factors from (x11

x21

, x11

x22

, x12

x21

, x12

x22

) with probabilities (¼, ¼, ¼, ¼)

Page 12: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Values are ordered: the observed trait is taken from the highestExamples: for one trait with alleles a and A (a<A)

- Offspring from AA and AA parents are AA with probability 1(A + A)(A + A) = 4AAObservation A

- Offspring from aa and aa parents are aa with probability 12a*2a = 4aaObservation a

- Offspring from AA and aa parents are Aa with probability 1(A + A)(A + a) = 2AA + 2AaObservation A

- Offspring from Aa and Aa parents are (AA,Aa,aa) with prob (¼, ½, ¼)Or (A+a)2 = AA + 2Aa + aaObservation (A,b) with freq (¾, ¼)

Comparative Genomics starts with Mendel laws

Page 13: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

For two traits, probability distribution of factors is the independent combination of the two distributions

Example: for two independent traits with alleles a and A (a<A), b and B (b<B)

(A+a)2(B+b)2 = AABB + 2AABb + AAbb + 2AaBB + 4AaBb + 2Aabb + aaBB + 2aaBb + aabb

Observations (AB , Ab , aB , ab) freq (9/16, 3/16, 3/16, 1/16)

Comparative Genomics starts with Mendel laws

Page 14: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

2 factors, 2 values each(round > wrinkled)And(yellow > green)

Page 15: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Genetic linkage: Bateson and Punnet experiment (1900)

Traits

A genome is not a set of genes

Page 16: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Introduction of a "linkage probability": lp

Cross (A+a)2 and (B+b)2 with linkage coefficient lp = 

lp2 (AABB + 2AaBb + aabb) +(1­lp)2 (AAbb + 2AaBb + aaBB) +2*lp(1­lp) (AABb + AaBB + Aabb + aaBb)

Observed frequencies  AB : (lp2 + 2)/4  

Ab : (1­lp2)/4aB : (1­lp2)/4ab : lp2/4

Page 17: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

lp = ½ means independance, like in Mendel's law

In Bateson and Punnet's experiment, lp = 0.88

Page 18: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Sturtevant, 1913 (with Thomas Morgan)

The linkage probability lp is "close" to a linear distance d=1-lpIf a, b, c, are factors such that d(a,c) > d(a,b) and d(a,c) > d(b,c)

d(a,c) = d(a,b) + d(b,c)

Factors are organized linearly (mathematical prediction of the structure of chromosomes)

Page 19: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget
Page 20: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Bridges, Morgan, 1916 Castle, 1919

Page 21: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget
Page 22: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Sturtevant, 1921

Prediction of the inversion

Page 23: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Confirmation by observations

Sturtevant, 1921

Page 24: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Biology, thanks to mathematics, is a predictive science: like in physics, objects can be predicted before being discovered

(chromosomes, inversions)

Epistemological parenthesis

Page 25: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Molecular phylogeny, Dobzhanski and Sturtevant, 1938

Page 26: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Mathematical investigation of inversions, Sturtevant, Tan, 1937

Page 27: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

You don't need fast computers or abundant sequences to do computational biology and bioinformatics.

Epistemological parenthesis

Page 28: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The mathematical problem stayed univestigated from 1941 to 1982, and it was « solved » in 1999 by an NP-completeness result due to Caprara

Epistemological parenthesis

Page 29: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Summary at this point :- chromosomes are linear arrangement of genes- this arrangement can mutate- reconstructing the history of mutations is a complicated (and complex) problem

But genes are still abstract points on the line. Following the discovery of the structure of DNA, what is the structure of a gene ?

Page 30: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Now that the chromosomes are (more or less) linear, what about the genes?

Page 31: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget
Page 32: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget
Page 33: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Benzer,1959

Page 34: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget
Page 35: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Sometimes important objects are invented in the context of their application.

Proxy to decide if a problem is polynomial: ask a biologist to solve a small instance

Epistemological parenthesis

Page 36: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

In fact interval graphs were invented before by Norbert Wiener in 1914, but in the language of Whitehead and Russell, and the biological article of Benzer is more readable by any mathematician

Page 37: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

For posterity write in common language, it is more sustainable than an invented one

Epistemological parenthesis

Page 38: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Now that the genes have a structure, what about the inversion problem ?

It was NP-hard without a gene structure, it turns polynomial with the gene structure

Page 39: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Modelisation: Genes are segments, say they are arcs

Genes

Page 40: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Modelisation: chromosomes are linear arrangementsof genes, draw a matching (disjoint edges)

Genes

Genome 1

adjacencies

Page 41: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Modelisation: for another arrangement on the samegenes, draw another matching

Genes

Genome 2

Page 42: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Inversion

Modelisation: define an operation to transform a matching into another

Inversion: cut two edges, this makes four pending extremities, join them with two edges

Page 43: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Double cut-and-join (DCJ)

Fusion

Fission

Inversion

Reciprocal Translocation

Non reciprocal translocation

Page 44: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Optimization problem : given two matchings, give a minimum DCJ scenario that transforms one into the other

Solution given in 2005, following the solution on permutations in 1995

Page 45: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The comparison graph

Genes

Genome 1

Genome 2

Page 46: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The comparison graph

Genome 1

Genome 2

Page 47: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The comparison graph

Cycles, RedBlue paths, RedRed paths,BlueBlue paths

Page 48: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

d(Red,Blue) = n – (c + e/2)

Minimum number of DCJ

Number of genesNumber of cycles

Number of redblue paths

Page 49: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Example

D = n – (c+e/2) = 9 – (4+2/2) = 4

Page 50: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

An additional constraint from reality (the linearity of genes) can miraculously turn a problem easier

But this was an illusion, it was going further in the dead end

Epistemological parenthesis

Page 51: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

51

Useful generalizations

These complexity results (illustrated by the non optimal solution of 1937) explain why, though discovered 25 years before, rearrangements are not yet as considered as substitutions.

Ancestral state reconstruction (with >2 genomes), counting, sampling, integrating other events (duplications)

All remain hard?

Page 52: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

52

One solution : try your favorite solver or metaheuristic or MCMC

Badger, application on mitochondria Larget, Simon, 2004, 2005

Application on Yersinia, Darling, Miklos, Ragan 2008

Theoretical results on convergence Miklos, Tannier, 2010, 2012

Still only inversions (no duplications, transfers...)

Limited number of genomes (up to ~10) with limited number of genes (up to ~30)

Navigating in the space of phylogenies, ancestral configurations and scenarios

See also « combinatorics of genome rearrangements », Fertin et al, 2009

Page 53: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

53

Reverse the history (and forget the biology): keep genes linear but forget the inversions (and then chromosomes)

For this look at another (more successful) history

Epistemological parenthesis

Page 54: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

54

Since the 1960's, phylogenies are based on substitutions more than rearrangements

Zuckerkandl and Pauling, 1965

Page 55: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

55

The (wrong) hypothesis simplifying all computations:

Sites evolve independently from each other

Then you have a wonderful tool to reconstruct ancestral sequences: dynamic programming again.

Page 56: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

56

General problem. Given- a rooted phylogenetic tree- a state space, and values from this space on the leaves- a non negative cost function d(x,y) on the state spaceFind the states at the internal nodes, minimizing the sum of the cost over all branches

Ancestral character reconstruction

Page 57: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

57

Ancestral character reconstruction

Variants: find- the minimum cost solutions or- the most probable solution, or- the most probable solutions, or- sample among probable solutions, or- sample according to the probability distribution.

Page 58: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

0 1 1 0 0

Example: presence or absence of a trait

Page 59: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

E R R U U

Example: one amino acid site

Page 60: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

3.5 3.2 3.2 2.4 2.2

Example: genome size (in millions of base pairs)

Page 61: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

0 1 1 0 0

Fitch Method :Ascending phase

0

{0,1}

1

{0,1}

Page 62: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

0 1 1 0 0

0

1

1

1Fitch Method :Descending phase

Page 63: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

0 1 1 0 0

0

1

0

1

Fitch Method :Descending phase

Page 64: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

0 1 1 0 0

0

0

0

0

Fitch Method :Some optimal solutions are never examined

Page 65: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

A general scheme by dynamic programming

f(u,x) is the cost of assigning x to node u in the subtree rooted at u

If u is a leaf, f(u,x) = 0 if x is the value of u, ∞ otherwiseIf u is not a leaf

Sankoff, Rousseau, 1975

Page 66: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

x

y z

c(x,1) = min (c(y,1)+c(z,1),c(y,0)+c(z,1)+1,c(y,1)+c(z,0)+1,c(y,0)+c(z,0)+2)

c(x,0) = min (c(y,1)+c(z,1) +2,c(y,0)+c(z,1)+1,c(y,1)+c(z,0)+1,c(y,0)+c(z,0))

x

c(x,1) = 0 et c(x,0) = infinite if x=1c(x,0) = 0 et c(x,1) = infinite if x=0

For a binary state space

Page 67: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Not generalizable to permutations or matchings because of the state space (one cost computation for every state)

Ancestral character reconstruction is NP-hard for all variants.

?

Page 68: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

68

Today often genomes are compared uniquely with substitutions

It is not that substitutions are more important for evolution, but that computational tools are available

An important part of adaptation might be due to rearrangements but it is not reachable because of algorithmic complexity

Epistemological parenthesis

Page 69: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

69

The (wrong) hypothesis simplifying all computations:

Sites evolve independently from each other(that is forgetting the linearity of genes)

For rearrangements consider adjacencies as sites and decide they evolved one by one independently from each other

Page 70: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Single cut-and-join (SCJ)

Fusion (or circularising a path)

Fission (or cutting a cycle)

Page 71: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Optimization problem : given two matchings, give a minimum SCJ scenario that transforms one into the other

Page 72: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Now it is possible to- optimise (trivial)- count, enumerate (exercise, think about dynamic programming)

What is the loss of accurary ?

Page 73: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

73

Simulation of inversions

N = number of genes (=number of sites) = 100

Page 74: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

74

Simulation of inversions

N = number of genes (=number of sites) = 100

Page 75: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

75

Estimation of the number of events under a Jukes-Cantor model if sites evolve independentlyNd = number of different sitesN = size of the sequencet = time (actual number of events per site)

N = number of genes (=number of sites) = 100

Page 76: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

76

Simulation of inversions

N = number of genes (=number of sites) = 100

Page 77: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Now it is possible to- optimise, count, enumerate, sample....

With the same complexity as for amino acid or nucleotide sequencesWith a good accuracy

What about ancestral character reconstruction ?

Page 78: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

0 1 1 0 0

0

1

1

0

1 0 0 1 1

1

1

1

1

Presence of an edgeAB AC

Ancestral character reconstruction :Are the internal nodes matchings ?

Page 79: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Equivalent to: if an edge xy is present at ancestral node then all edges xt and yt are not

Proof (exercise): - by induction on the ascending phase- by contradiction on the descending phase

Again an illusion theorem. "Fitch" Solutions choosing 0 (absence) if there is an ambiguity at the root garanty that evolving each edge independently makes a matching at each ancestral node.

(Feijao and Meidanis, 2013)

Use of Fitch is bad for enumeration and counting in the ancestorsCounting SCJ scenarios is #P-hard (Miklos, Tannier, 2014)

Page 80: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Bridges, Morgan, 1916 Castle, 1919

Solution:Give up the linearity of chromosomes(after all it was not sure)

Page 81: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

81

But... more embarassing... we still assume that the genomes all have the same genes

Now it is possible to- optimise, count, enumerate, sample, compute ancestral states, sample among them

Ancestral chromosomes are not linear... Who cares ? If someone cares, linearize them with a maximum matching algorithm

Page 82: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

82

Genes evolve !!!

Page 83: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The effect of a duplication on adjacencies if chromosomes are linear:

- either it is a tandem duplication and each exemplar receives a neighbor

- or one copy is inserted elsewhere

Page 84: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The effect of a loss on adjacencies if chromosomes are linear:

Page 85: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The effect of a duplication on adjacencies if chromosomes are NOT linear:

- either it is a tandem duplication and each exemplar receives a neighbor

- or one copy is inserted elsewhere

Any combination of edge distribution is allowed (including none to one copy)

Page 86: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

The effect of a loss on adjacencies if chromosomes are NOT linear:

or

Page 87: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

human humanhuman human

Duplications

cattle cattlecattle cattle

One gene in the human/cattle ancestor Two genes in the human/cattle ancestor

The history of duplications for one gene can be drawn on a gene tree

Page 88: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Open problems (solution might be easy):

Given two genomes, transform one into the other with gains and losses of adjacencies, gains, duplications and losses of genes, with a minimum number of events.

Variants :- Genes can be vertices or arcs, labels can be repeated- adjacencies can form matchings or graphs- duplications can be in tandem or not, and duplications and loss can concern a connected subgraph of genes- history of duplications and losses of genes can be considered known or not

Generalization for ancestral character reconstruction

Page 89: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Almost all variants are open.

My conjecture is that all are NP-hard for linear chromosomes« Genomes Containing Duplicates Are Hard to Compare », Chauve et al, 2006

Theorem: one is polynomial if- chromosomes are not required to be linear- history of duplications are known (gene trees)- duplications are in tandem- losses remove all adjacencies

In this case ancestral reconstruction on a phylogenetic tree is polynomial

Bérard et al, DeCo, 2012

The conditions preserve the independance of adjacencies and allows dynamic programming

Page 90: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

90

Dynamic programming with propagation rules for adjacencies

Two genes

One duplicates

Adjacency is propagated to one copy

Page 91: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

91

Dynamic programming with propagation rules for adjacencies

Page 92: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

92

What time is it ?

If time << 12h30, then vote between- the recent story, yersinia pestis in the middle ages

- the old story, dating with horizontal gene transfers

ElseThanks for still being here, and have a nice lunch

Bioinformatics, 2013

PNAS, 2012

Page 93: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

93

Transfers

Page 94: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Reconstructing ancestral genomesIn the presence of transfers

Page 95: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Methanococcus jannashii

Aeropyrum pernix

Synechocystis sp.

Pyrococcus horikoshii Pyrococcus abyssi

Archaeoglobus fulgidus

Thermoplasma acidophilum

Halobacterium sp.

Streptomyces coelicolor

Mycobacterium tuberculosis

Deinococcus radiodurans

Bacillus subtilis

Mycoplasma pneumoniae Mycoplasma genitalium

Ureaplasma parvum

Helicobacter pylori Campylobacter jejuni

Xylella fastidiosa

Haemophilus influenzae Vibrio cholerae

Buchnera sp.

Pseudomonas aeruginosa

Neisseria meningitidis

Rickettsia prowazekii

Bacillus halodurans

Treponema pallidum Borrelia burgdorferi

Chlamydia trachomatis Chlamydia muridarum Chlamydophila pneumoniae

Thermotoga maritima

Aquifex aeolicus

Staphylococcus aureusLactococcus lactisStreptococcus pyogenes

Mycobacterium leprae

Pasteurella multocida

Caulobacter crescentus

Sulfolobus solfataricus

Arabidopsis thaliana

Methanobacterium thermoautotrophicum

0.5

Methanococcus jannashii

Aeropyrum pernix

Synechocystis sp.

Pyrococcus horikoshii Pyrococcus abyssi

Archaeoglobus fulgidus

Thermoplasma acidophilum

Halobacterium sp.

Streptomyces coelicolor

Mycobacterium tuberculosis

Deinococcus radiodurans

Bacillus subtilis

Mycoplasma pneumoniae Mycoplasma genitalium

Ureaplasma parvum

Helicobacter pylori

Xylella fastidiosa

Haemophilus influenzae Vibrio cholerae

Buchnera sp.

Pseudomonas aeruginosa

Neisseria meningitidis

Rickettsia prowazekii

Bacillus halodurans

Treponema pallidum Borrelia burgdorferi

Chlamydia trachomatis Chlamydia muridarum Chlamydophila pneumoniae

Thermotoga maritima

Aquifex aeolicus

Staphylococcus aureusLactococcus lactisStreptococcus pyogenes

Mycobacterium leprae

Pasteurella multocida

Caulobacter crescentus

Sulfolobus solfataricus

Arabidopsis thaliana

Methanobacterium thermoautotrophicum

Bacteria

Archaea

plantscyanobacteria

Arabidopsis thalianaChlamydomonas reinhardtii

Eukaryota

Campylobacter jejuni

Lateral gene transfer produces gene phylogenieswhich are far from species phylogenies

Page 96: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Two possible models for transfers

A transfered gene replaces an old gene

Identification: subtree prune and regraft, NP-Hard

A tranfered gene is added in the genome

Gene lineages are independant, reconciliation, polynomial

Page 97: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Transfers contain informations about dates when fossils are absent

Page 98: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

A dated cyanobacteria treeWith 473 gene families

Agrees with all known dated fossil records

PNAS, 2012

Page 99: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Épidémie médiévale qui a ravagé l'Europe au XIVe siècle

Trois grandes épidémies de pestes connues dans l'histoire- « justinienne », empire byzantin au VIIIe- peste noire, Europe occidentale 1348- Asie XIXe

Page 100: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Yersinia pestis

- Découverte au XIXe par Alexander Yersin, de l'institut Pasteur, pendant la troisième épidémie, initialement appelée Pastorella Pestis

- Première séquence chromosomique complète en 2001 (Parkhill et al, Nature, 2001)

- A évolué depuis un ancêtre commun (20000 ans) avec Yersinia pseudotuberculosis, a gagné un plasmide soupçonné d'être responsable de la pathogénicité chez l'humain

- Séquence pleine de répétition, évolution pleine de réarrangements (Chain et al, PNAS, 2004)

Page 101: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Premier séquençage de génome complet d'un bactérie pathogène ancienne

Page 102: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Séquençage Illumnia de 4 individus d'un cimetière londonien2366647 reads, taille ~55 bp, couverture 30X, capturé avec la

souche CO92Tentative d'assemblage (Velvet) à partir des souches CO92 et

Microtus : 2134 contigs, couvrant 4Mb

Page 103: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget
Page 104: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

A partir des 2134 contigs, une comparaison avec 11 génomes apparentés permet de reconstituer la séquence chromosomique

complète du génome ancien

1/ on découpe les contigs

On obtient 2619 familles de marqueurs, potentiellement dupliqués

Page 105: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

A partir des 2134 contigs, une comparaison avec 11 génomes apparentés permet de reconstituer la séquence chromosomique

complète du génome ancien

2/ on détecte des adjacences ancestrales potentielles entre marqueurs

Page 106: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

A partir des 2134 contigs, une comparaison avec 11 génomes apparentés permet de reconstituer la séquence chromosomique

complète du génome ancien

3/ on donne un ordre aux marqueurs en fonction des adjacences trouvées (problèmes avec les conflits et les répétitions)

Page 107: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

Une séquence hybride, 82 % séquencée, 18 % prédite

4/ on remplit les gaps entre les marqueurs par une estimation de la séquence ancestrale à partir d'alignements

Page 108: Comparative Genomicsigm.univ-mlv.fr/AlgoB/EPIT2014/08-Tannier.pdf · See also « combinatorics of genome rearrangements », Fertin et al, 2009. 53 Reverse the history (and forget

La séquence est distante de toutes les souches actuelles en termes de réarrangements

Certains nouveaux gènes apparaissent à la suite de ces réarrangements