Upload
archibald-burke
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Defining Gene Clusters:24 Ways of Looking at Mount Fuji
Anne Bergeron, UQAMDublin, September 19, 2005
7. Mt Fuji from the Foot
Defining Gene Clusters:24 Ways of Looking at Mount Fuji
Anne Bergeron, UQAMDublin, September 19, 2005
"It struck me that it would be good to take one thing in life and regard it from many viewpoints, ... " Roger Zelazny
The basic problem
Genome A
Genome B
Genome C
We start with a set of genomes, labeled by gene names, domains, or synteny blocks,and a similarity relation on those labels.
Highlighting a gene means selecting all labels that are similar.
Genes, or other types of signals, can appear in multiple copies in a genome, or even be missing. In this talk, the similarity relation is "given" and is anequivalence relation.
Genome A
Genome B
Genome C
The basic problemWe are interested in what happens when a set of genes is highlighted.
A set of genes : { }
Boring...
Genome A
Genome B
Genome C
The basic problem
Another set of genes: { }
Interesting ?Measures of surprise are studied by Durand, Haque, Hoberman, Sankoff, Raghupathy, etc.
The basic problem
Goal : Given a (big) set of genomes, automatically identify all potentially interesting sets of genes.
1. Mount Fuji from Owari
Towards formal models
Towards formal models
What do labels stand for?
How many labels and genomes do we want to compare ?
What do we want to do with the resulting clusters ?
Towards formal models: Example 1
From: Eichler and Sankoff, Science (301:793-797), 2003
Definition of labels and similarity:Large homology segments disrupted only by local micro-rearrangements.
A total of 281 synteny blocks,colored in the human genomeby their mouse chromosome number.
Interesting features:
Chromosome XChromosome 17Chromosome 20
Application:
Genome evolution
Towards formal models: Example 2
Definition of labels and similarity:Gene annotations of chloroplasts.
Trachelium
Campanula
Adenophora
Symphandra
Walhenbergia
Merceria
Interesting features:
Rearrangements
Application:
Phylogeny
Towards formal models: Example 3
From: Pasek et al, Genome Research (15:867-874), 2005
Definition of labels and similarity:PFAM Domain numbers labeling fourbacterial genomes.
Interesting features:
DuplicationsInsertionsRearrangements
Application:
Operon identification
Towards formal models: Example 4
From: Pasek et al, Genome Research (15:867-874), 2005
Definition of labels and similarity:PFAM Domain numbers labeling fourbacterial genomes.
Application:
Identification of orthologsand/or duplicate segments.
With such an high E-value,the potential duplicate wouldhave been missed by a comparisonbased on sequence similarity.
Towards formal models: Example 5
Definition of labels and similarity:Large homology segments disrupted only by local micro-rearrangements.
Comparing 16 segments of the mouseand rat chromosome X.
Application:
Reconstructing ancestors
From: Bérard et al, WABI 2005
Mouse = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rat = -4 -3 -2 1 -13 -15 14 -16 8 9 10 -11 12 5 6 7
2. Mt Fuji from a Teahouse at Yoshida
Down to earth details
Down to earth details
Do we allow gaps ?
Do we allow rearrangements?
Do we allow duplicates and missing genes ?
Do we allow multiple genomes orself-comparison ?
How about "extensions" ?
Genome A
Genome B
Genome C
A set of genes: { }
Down to earth details : Model 1
No gaps, no duplications, any rearrangement.
Genome A
Genome B
Genome C
A set of genes: { }
No gaps, no duplications, any rearrangement.
What about this gene? Should we add it ?
Down to earth details : Model 1
Genome A
Genome B
Genome C
A set of genes: { }
No gaps, no duplications, any rearrangement.
What about this gene? Should we add it ?
Down to earth details : Model 1Extension
Genome A
Genome B
Genome C
A set of genes: { }
No gaps, duplications, any rearrangement.
Genes not in the set
Down to earth details : Model 2
Genome A
Genome B
Genome C
A set of genes: { }
Gaps, no duplications, any rearrangement.
Down to earth details : Model 3
Genome A
Genome B
Genome C
A set of genes: { }
Gaps, missing/inserted genes, any rearrangement.
Down to earth details : Model 4
Genome A
Genome B
Genome C
A set of genes: { }
Gaps, missing genes, duplications, any rearrangement.
With gap size = 1, we get 4 occurrences.
Reducing the number of genes....
Down to earth details : Model 5
Genome A
Genome B
Genome C
A smaller set of genes: { }
... yields 5 occurrences.
Down to earth details : Model 5
24. Mount Fuji in a Summer Storm
A general framework
A general framework
Given a gap g, an occurrence of S is a maximal run of genes of S, separated by gaps of at most g genes not in S,and that contains at least one of each gene of S.
A set S of genes: { }
A set of genes S is an extension of a set T, included in S, if each occurrence of T is contained in an occurrence of S.
S = { } is an extension of T= { }
> g > g > g≤ g
Occurrence #1 Occurrence #2
A chromosome:
A general framework
Given a gap g, an occurrence of S is a maximal run of genes of S, separated by gaps of at most g genes not in S,and that contains at least one of each gene of S.
A set S of genes: { }
A set of genes S is an extension of a set T, included in S, if each occurrence of T is contained in an occurrence of S.
S = { } is an extension of T= { }
> g > g > g≤ g
Occurrence #1 Occurrence #2
A chromosome:
• g = 0 or g > 0
ChoicesWhen g = 0, the number of candidates is polynomial in the number of genes.
When g > 0, the number ofcandidates can be exponentialin the number of genes.
A general framework
Even with g = 1, there are problems. For example, with g = 0, the sequence of genes:
a b c d e fproduces one potential cluster that contains both a and f. But with g = 1, there are 8 of them:
a b c d e fa b c d fa b c e fa b d e fa c d e fa c e f a b d fa c d f
The number of these sequences grows in a Fibonacci progression!
• g = 0 or g > 0
Choices
• Duplications or no duplications Duplications usually meansan exponential number of candidates but, most of the time,are unavoidable.
Models without duplications are,nevertheless, useful in many situations.
A general framework
• g = 0 or g > 0
Choices
• Duplications or no duplications
• Three ways of filtering candidates
Filtering is mostly based on the properties of the extension relation.
If the number of candidates is low, filtering is not necessary,but it can be relevant.
For models with a huge numberof candidates, filtering is a must.
A general framework
• g = 0 or g > 0
Choices
• Duplications or no duplications
• Three ways of filtering candidates
• Formal or heuristic Formal models have inherentcomputational problems whenapplied to real data.
Heuristics will always be useful.
A general framework
• g = 0 or g > 0
Choices
• Duplications or no duplications
• Three ways of filtering candidates
• Formal or heuristic
A general framework
2 x 2 x 3 x 2 = 24How convenient!
20. Mount Fuji from Inume Pass
*Voluntary simplicity is a lifestyle considered by its adherents to be a sustainable, ecologically sensitive alternative to the typical, western consumerist lifestyle. [Ref. Wikipedia]
Common intervals: Voluntary simplicity*
Common intervals: Voluntary simplicity*
*Voluntary simplicity is a lifestyle considered by its adherents to be a sustainable, ecologically sensitive alternative to the typical, western consumerist lifestyle. [Ref. Wikipedia]
A (partial) list of credits:Uno and Yagiura (2000)Heber and Stoye (2001)Bergeron, Heber and Stoye (2002)Didier (2003)Schmidt and Stoye (2004)Figeac and Varré (2004)Bérard, Bergeron and Chauve (2004)Blin, Chauve and Fertin(2005)Landau, Parida and Weizman (2005)Tannier and Sagot (2005)Bérard, Bergeron, Chauve and Paul (2005)Bergeron, Chauve, de Montgolfier and Raffinot (2005)
Common intervals
• g = 0
Choices
• No duplications
• No filtering
• Formal
Genome A
Genome B
Genome C
The basic model of common intervals oftenyields a large number of 'uninteresting clusters'.However, filtering provides unusual informationon whole genome organization.
Common intervals -> Strong Intervals
• g = 0
Choices
• No duplications
• Filtering
• Formal
Genome A
Genome B
Common intervals
stuv
Both t and u are two different extensions of the common interval s: Remove them.
Strong intervalss
v
Strong Intervals
From: Bérard et al, WABI 2005
Mouse = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rat = -4 -3 -2 1 -13 -15 14 -16 8 9 10 -11 12 5 6 7
This tree displays the strongintervals between the synteny blocks of the mouse and rat chromosomes X.
This kind of tree is known as a PQ-tree. Strong intervals possess a rich combinatorial structure that can be exploited both from the biological and computation perspective.
13 15 14 16 8 9 10 11 12 5 6 7
4 3 2 1
13 15 14 16
8 9 10 11 12 5 6 715 14
15 14 8 9 10 121 5 6 74 3 2 1113 16
4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
This tree provides guidelines to possible rearrangementscenarios that transform the rat chromosome into a mouse chromosome. These scenarios preserve all common intervals.
13 15 14 16 8 9 10 11 12 5 6 7
4 3 2 1
13 15 14 16
8 9 10 11 12 5 6 715 14
15 14 8 9 10 121 5 6 74 3 2 1113 16
4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Intervals are first labeled (in red) with respect to their relative orientation.
4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7
13 15 14 16 8 9 10 11 12 5 6 7
4 3 2 1
13 15 14 16
8 9 10 11 12 5 6 715 14
15 14 8 9 10 121 5 6 74 3 2 1113 16
Strong Intervals : transforming a rat into a mouse
Intervals are first labeled (in red) with respect to their relative orientation.
4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7
4 3 2 1
4 3 2 1
13 15 14 16 8 9 10 11 12 5 6 7
13 15 14 16
8 9 10 11 12 5 6 715 14
15 14 8 9 10 12 5 6 71113 161
4 3 2 1
4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 1
4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7
1
4 3 2 1
4 3 2
13 15 14 16 8 9 10 11 12 5 6 7
13 15 14 16
8 9 10 11 12 5 6 715 14
15 14 8 9 10 12 5 6 71113 164
1 2 3 4
1 2 3
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 4 3 2 1
4
1 2 3 4
1 2 3
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
13 15 14 16 8 9 10 11 12 5 6 7
13 15 14 16
15 14
15 1413 16
8 9 10 11 12 5 6 7
8 9 10 12 5 6 71113
13 15 14 16
13 15 14 16 8 9 10 11 12 5 6 7
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 13
13
13 15 14 16
13 15 14 16 8 9 10 11 12 5 6 7
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
4
1 2 3 4
1 2 3
15 14
15 14 16
8 9 10 11 12 5 6 7
8 9 10 12 5 6 711
15 14
13 15 14 16
14
13 15 14 16 8 9 10 11 12 5 6 7
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 14
15 14
13 15 14 16
14
13 15 14 16 8 9 10 11 12 5 6 7
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
134
1 2 3 4
1 2 3 15 16
8 9 10 11 12 5 6 7
8 9 10 12 5 6 71116
13 15 14 16
13 15 14 16 8 9 10 11 12 5 6 7
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 16
16
13 15 14 16
13 15 14 16 8 9 10 11 12 5 6 7
1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7
134
1 2 3 4
1 2 3
15 14
1415
8 9 10 11 12 5 6 7
8 9 10 12 5 6 711
14 15
1514
13 14 15 16
13 14 15 16 8 9 10 11 12 5 6 7
1 2 3 4 13 14 15 16 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 14 15
16
13 15 14 16
13
15 14
1415
14 15
1514
13 14 15 16
13 14 15 16 8 9 10 11 12 5 6 7
1 2 3 4 13 14 15 16 8 9 10 11 12 5 6 7
4
1 2 3 4
1 2 3
8 9 10 11 12 5 6 7
8 9 10 12 5 6 711
14 15
1514
13 14 15 16
1613
15 14
1415
16 15 14 13
1316
16 15 14 13 8 9 10 11 12 5 6 7
1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 13 14 15 16
14 15
1514
13 14 15 16
1613
15 14
1415
16 15 14 13
1316
16 15 14 13 8 9 10 11 12 5 6 7
1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7
4
1 2 3 4
1 2 3
8 9 10 11 12
8 9 10 1211
5 6 7
5 6 711
8 9 10 11 12
16 15 14 13 8 9 10 11 12 5 6 7
1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 11
16 15 14 13 8 9 10 11 12 5 6 7
1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7
14 15
1514
13 14 15 16
1613
15 14
1415
16 15 14 13
13164
1 2 3 4
1 2 3 11
8 9 10 11 12
8 9 10 12
5 6 7
5 6 79
12 11 10 9 8
12 11 10 8
16 15 14 13 12 11 10 9 8 5 6 7
1 2 3 4 16 15 14 13 12 11 10 9 8 5 6 7
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 8 9 10 11 12
9
12 11 10 9 8
12 11 10 8
16 15 14 13 12 11 10 9 8 5 6 7
1 2 3 4 16 15 14 13 12 11 10 9 8 5 6 7
14 15
1514
13 14 15 16
1613
15 14
1415
16 15 14 13
13164
1 2 3 4
1 2 3
5 6 7
5 6 7
7 6 5
7 6 5
16 15 14 13 12 11 10 9 8 7 6 5
1 2 3 4 16 15 14 13 12 11 10 9 8 7 6 5
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 5 6 7
1 2 3 4 16 15 14 13 12 11 10 9 8 7 6 5
7 6 5
7 6 5
16 15 14 13 12 11 10 9 8 7 6 5
9
12 11 10 9 8
12 11 10 8
14 15
1514
13 14 15 16
1613
15 14
1415
16 15 14 13
13164
1 2 3 4
1 2 3
5 6 7
14 15 16
5 6 7 8 9 10 11 12 13 14 15 16
12
8 9 10 11 12
9 10 11 13
14 15
76
13 14 15 16
85
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 5 6 7 ... 14 15 16
1 2 3 4 16 15 14 13 12 11 10 9 8 7 6 5
4
1 2 3 4
1 2 3
5 6 7
14 15 16
5 6 7 8 9 10 11 12 13 14 15 16
12
8 9 10 11 12
9 10 11 13
14 15
76
13 14 15 16
85
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Strong Intervals : transforming a rat into a mouse
Then all strong intervals that disagree with their parent are inverted : 5 6 7 ... 14 15 16
18. Mt Fuji from the Offing in Kanagawa
Domain Teams: The 'eXtreme' model
A (partial) list of credits:Bergeron, Corteel and Raffinot (2002)Luc, Risler, Bergeron and Raffinot (2003)He and Goldwasser (2004)Béal, Bergeron, Corteel and Raffinot (2004)Pasek, Bergeron, Risler, Louis, Ollivier and Raffinot (2005)Blin, Chauve and Fertin (2005)
Domain Teams: The 'eXtreme' model
Domain Teams
• g > 0
Choices
• Duplications
• Heavy filtering
• Formal
Genome A
Genome B
Remove them all!
has an extension. has an extension.
has an extension. has an extension.
Surviving teams:
Domain Teams : Example
67591 Domains 50078 Proteins 16 ChromosomesMaximum gap: 3 16713 Domain Teams
Domain Teams : Example
From: Pasek et al, Genome Research (15:867-874), 2005
The combinatorial beauty of nature
12. Mt Fuji from Lake Kawaguchiç
The combinatorial beauty of nature
Does nature allow all possiblerearrangements ?
Six domains can theoretically form 63 potential teams.If they are labelled as {a, b, c, d, e, f}, the possible teamswith more than one member are:{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {b, c}...{a, b, c}, {a, b, d}, {a, b, e}, ......{a, b, c, d, e, f}
For 6 domains, of the 63 possibilities, we found 35 teams that had at least two occurrences and no extension.q
The combinatorial beauty of nature
Promiscuous domains
Who are they?PF00005 ABC transporterPF00072 Response regulator receiver domainPF00486 Transcriptional regulatory proteinPF00512 His Kinase A PF00528 Binding-protein-dependent transport system inner membranePF00672 HAMP domain
The need for heuristics
21. Mount Fuji from the Totomi Mountains
The need for heuristics
• g > 0
Choices
• Duplications
• No filtering
• Heuristic
From: St-Onge, et al. Poster RECOMB CG 2005
Very reasonable approximationsof the general model can be obtainedefficiently -- a few minutes -- in the case of very large scale comparisons.
The need for heuristics
An uncertainty principle
With the general model of gene clusters, it is impossible to predict simultaneously the computing time AND the properties of the output.
Marie-Pierre Béal, Informatique, Marne-la-ValléeSèverine Bérard, INRA, ToulouseMathieu Blanchette, McGill UniversitySylvie Corteel, PRiSM, VersaillesSteffen Heber, Raleig, USAHokusai Katsushika: 1760-1849Nicolas Luc,Génome et informatique, EvryFabien de Montgolfier, LIAFA, ParisChristophe Paul, LIRMM, MontpellierSophie Pasek, Génome et informatique, EvryJean-Loup Risler, Génome et informatique, EvryMathieu Raffinot, Laboratoire Poncelet, MoscouJens Stoye, Technische Facultat, Bielefeld
Credits
Cedric ChauveAnnie ChateauOlivier GingrasYannick GingrasAndré LevasseurJacqueline RwirangiraKarine St-Onge
Marie-Pierre Béal, Informatique, Marne-la-ValléeSèverine Bérard, INRA, ToulouseMathieu Blanchette, McGill UniversitySylvie Corteel, PRiSM, VersaillesSteffen Heber, Raleig, USAHokusai Katsushika: 1760-1849Nicolas Luc,Génome et informatique, EvryFabien de Montgolfier, LIAFA, ParisChristophe Paul, LIRMM, MontpellierSophie Pasek, Génome et informatique, EvryJean-Loup Risler, Génome et informatique, EvryMathieu Raffinot, Laboratoire Poncelet, MoscouJens Stoye, Technische Facultat, Bielefeld
Credits
Cedric ChauveAnnie ChateauOlivier GingrasYannick GingrasAndré LevasseurJacqueline RwirangiraKarine St-Onge