65
Predicting interactions Predicting interactions between genes based on genome between genes based on genome sequence comparisons sequence comparisons The “genomic context” component of The “genomic context” component of STRING STRING Bioinformatics seminar series Bioinformatics seminar series 15-11-2005 15-11-2005 Berend Snel Berend Snel

Predicting interactions between genes based on genome sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series 15-11-2005

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Predicting interactions between Predicting interactions between genes based on genome sequence genes based on genome sequence

comparisonscomparisons

The “genomic context” component of STRINGThe “genomic context” component of STRING

Bioinformatics seminar seriesBioinformatics seminar series15-11-200515-11-2005

Berend SnelBerend Snel

Predicting interactions between Predicting interactions between genes based on genome sequence genes based on genome sequence

comparisonscomparisons

The “genomic context” component of STRINGThe “genomic context” component of STRING

Bioinformatics seminar seriesBioinformatics seminar series15-11-200515-11-2005

Berend SnelBerend Snel

TodayTodayTodayToday

• Announcement: the seminar of Jakob de Vlieg Announcement: the seminar of Jakob de Vlieg on 22 November is canceled. Please consult on 22 November is canceled. Please consult the website of the seminar series the website of the seminar series (www.cmbi.ru.nl/edu/seminars) for the new (www.cmbi.ru.nl/edu/seminars) for the new date. date.

• Seminar (today); please ask questions !!!Seminar (today); please ask questions !!!

• Handing out article and questions : Handing out article and questions : ““Identification of a bacterial regulatory system Identification of a bacterial regulatory system for ribonucleotide reductases by phylogenetic for ribonucleotide reductases by phylogenetic profiling.profiling.” Read the article and hand in the ” Read the article and hand in the answers to the questions by Monday November answers to the questions by Monday November 28th. 28th.

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteins; what & whyproteins; what & why

• Genomic context methods Genomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolABiochemistry by other means BolA• In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteins; what & whyproteins; what & why

• Genomic context methods Genomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolABiochemistry by other means BolA• In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

Complete genomes, now what?Complete genomes, now what?Complete genomes, now what?Complete genomes, now what?

• Post-genomic era = we have the parts list Post-genomic era = we have the parts list (complete genomes) (complete genomes)

• to understand the cell we need to know the to understand the cell we need to know the functions of the genes functions of the genes

• Post-genomic era = we have the parts list Post-genomic era = we have the parts list (complete genomes) (complete genomes)

• to understand the cell we need to know the to understand the cell we need to know the functions of the genes functions of the genes

A bacterial genomeA bacterial genomeA bacterial genomeA bacterial genome gene 408..1748 /gene="dnaA" /locus_tag="BCE33L0001" /old_locus_tag="BCZK0001" CDS 408..1748 /gene="dnaA" /locus_tag="BCE33L0001“ /old_locus_tag="BCZK0001" /inference="non-experimental evidence, no additional details recorded“ /codon_start=1 /transl_table=11 /product="chromosomal replication initiator protein“ /protein_id="AAU20227.1" /db_xref="GI:51978677“ /translation="MENISDLWNSALKELEKKVSKPSYETWLKSTTAHNLKKDVLTIT APNEFARDWLESHYSELISETLYDLTGAKLAIRFIIPQSQAEEEIDLPPAKPNAAQDD SNHLPQSMLNPKYTFDTFVIGSGNRFAHAASLAVAEAPAKAYNPLFIYGGVGLGKTHL MHAIGHYVIEHNPNAKVVYLSSEKFTNEFINSIRDNKAVDFRNKYRNVDVLLIDDIQF LAGKEQTQEEFFHTFNALHEESKQIVISSDRPPKEIPTLEDRLRSRFEWGLITDITPP DLETRIAILRKKAKAEGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLINKDIN ADLAAEALKDIIPNSKPKIISIYDIQKAVGDVYQVKLEDFKAKKRTKSVAFPRQIAMY LSRELTDSSLPKIGEEFGGRDHTTVIHAHEKISKLLKTDTQLQKQVEEINDILK" gene 1927..3066 /gene="dnaN" /locus_tag="BCE33L0002" /old_locus_tag="BCZK0002" CDS 1927..3066 /gene="dnaN" /locus_tag="BCE33L0002" /old_locus_tag="BCZK0002" /EC_number="2.7.7.7" /inference="non-experimental evidence, no additional details recorded" /codon_start=1 /transl_table=11 /product="DNA polymerase III, beta subunit" /protein_id="AAU20226.1" /db_xref="GI:51978676" /translation="MRFTIQKDYLVRSVQDVMKAVSSRTTIPILTGIKVVATEEGVTL TGSDADISIESFIPVEEDGKEIVEVKQSGSIVLQAKYFSEIVKKLPKETVEISVENHL MTKITSGKSEFNLNGLDSAEYPLLPQIEEHHVFKIPTDLLKHMIRQTVFAVSTSETRP ILTGVNWKVYNSELTCIATDSHRLALRKAKIEGIADEFQANVVIPGKSLNELSKILDE SEEMVDIVITEYQVLFRTKHLLFFSRLLEGNYPDTTRLIPAESKTDIFVNTKEFLQAI DRASLLARDGRNNVVKLSTLEQAMLEISSNSPEIGKVVEEVQCEKVDGEELKISFSAK YMMDALKALDSTEIKISFTGAMRPFLIRTVNDESIIQLILPVRTY"  

For most genes in any genome we need function For most genes in any genome we need function predictionprediction

For most genes in any genome we need function For most genes in any genome we need function predictionprediction

- E. Coli, the most intensively studied organism: only 1924 genes (~43%) have been (partially)

experimentally characterized.

- E. Coli, the most intensively studied organism: only 1924 genes (~43%) have been (partially)

experimentally characterized.

What is function ?

Various levels of description:

Sequence similarity/homology has the largest relevance for “Molecular Function”. This aspect of protein function is best conserved.Molecular function can often be predicted from similarities between protein sequences (BLAST), or structures.

What is function ?

Various levels of description:

Sequence similarity/homology has the largest relevance for “Molecular Function”. This aspect of protein function is best conserved.Molecular function can often be predicted from similarities between protein sequences (BLAST), or structures.

Predicting protein functionPredicting protein functionPredicting protein functionPredicting protein function

Homology: BLAST and / or SMART/PFAM/CDD Homology: BLAST and / or SMART/PFAM/CDD Homology: BLAST and / or SMART/PFAM/CDD Homology: BLAST and / or SMART/PFAM/CDD

gi|22209068|Mayven [Homo sapiens] 1159

gi|21410410|Klhl2 protein [Mus musculus] 1145

. . .

. . .

i|55725960|hypothetical protein [Pongo pygmaeus] 887

gi|6644176|Klhl3 [Homo sapiens] 885

gi|19354513|Klhl3 protein [Mus musculus] 765

gi|12644384| Ring canal kelch protein [Drosophila melanogaster] 676

““Beyond” homology and molecular functionBeyond” homology and molecular function““Beyond” homology and molecular functionBeyond” homology and molecular function

Homology based function prediction works Homology based function prediction works very well, yet:very well, yet:

• a large fraction of genes are poorly a large fraction of genes are poorly described (no homologs, uncharacterized described (no homologs, uncharacterized homologs; this holds for ~60% of the homologs; this holds for ~60% of the human genes)human genes)

• There are other aspects of function: There are other aspects of function: functional associations, e.g. the target of a functional associations, e.g. the target of a protein kinase or a transcriptional protein kinase or a transcriptional regulator, I.e. to understand the cell we regulator, I.e. to understand the cell we need to know the interactions of the genesneed to know the interactions of the genes

Thus: predicting associationsThus: predicting associations

Homology based function prediction works Homology based function prediction works very well, yet:very well, yet:

• a large fraction of genes are poorly a large fraction of genes are poorly described (no homologs, uncharacterized described (no homologs, uncharacterized homologs; this holds for ~60% of the homologs; this holds for ~60% of the human genes)human genes)

• There are other aspects of function: There are other aspects of function: functional associations, e.g. the target of a functional associations, e.g. the target of a protein kinase or a transcriptional protein kinase or a transcriptional regulator, I.e. to understand the cell we regulator, I.e. to understand the cell we need to know the interactions of the genesneed to know the interactions of the genes

Thus: predicting associationsThus: predicting associations

TranscriptionregulationTranscriptionregulation

PPSignalling pathwaysSignalling pathways

Protein complexesProtein complexes

Metabolic pathwaysMetabolic pathways

There are many types of There are many types of functional associationsfunctional associations (AKA functional interactions, interactions, (AKA functional interactions, interactions,

functional links, functional relations) in molecular functional links, functional relations) in molecular biologybiology

There are many types of There are many types of functional associationsfunctional associations (AKA functional interactions, interactions, (AKA functional interactions, interactions,

functional links, functional relations) in molecular functional links, functional relations) in molecular biologybiology

Cellular processCellular processCellular processCellular process

Types of functional associationsTypes of functional associationsTypes of functional associationsTypes of functional associations

metabolic pathways: filling gapsmetabolic pathways: filling gapsmetabolic pathways: filling gapsmetabolic pathways: filling gaps

Types of functional associationsTypes of functional associationsTypes of functional associationsTypes of functional associations

Transcription regulationTranscription regulationTranscription regulationTranscription regulation

PP

Signalling pathwaysSignalling pathways

Types of functional associationsTypes of functional associationsTypes of functional associationsTypes of functional associations

Cellular processCellular process(“DNA repair”, “Apoptosis”)(“DNA repair”, “Apoptosis”)Cellular processCellular process(“DNA repair”, “Apoptosis”)(“DNA repair”, “Apoptosis”)

Protein complexesProtein complexes

So how can knowledge of the functional So how can knowledge of the functional associations help?associations help?

So how can knowledge of the functional So how can knowledge of the functional associations help?associations help?

• If we did not know anything about the If we did not know anything about the function of the protein we can now say in function of the protein we can now say in which process it is involvedwhich process it is involved

• If we already knew something about the If we already knew something about the function, we might now know much more function, we might now know much more about the function (I.e. if we knew it was a about the function (I.e. if we knew it was a hydrolase we might now know in which hydrolase we might now know in which metabolic pathway it is active)metabolic pathway it is active)

• If the gene was already well characterized, If the gene was already well characterized, we might understand its role better (I.e. we might understand its role better (I.e. new targets for a kinase) new targets for a kinase)

• If we did not know anything about the If we did not know anything about the function of the protein we can now say in function of the protein we can now say in which process it is involvedwhich process it is involved

• If we already knew something about the If we already knew something about the function, we might now know much more function, we might now know much more about the function (I.e. if we knew it was a about the function (I.e. if we knew it was a hydrolase we might now know in which hydrolase we might now know in which metabolic pathway it is active)metabolic pathway it is active)

• If the gene was already well characterized, If the gene was already well characterized, we might understand its role better (I.e. we might understand its role better (I.e. new targets for a kinase) new targets for a kinase)

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – General (how do we predict functional General (how do we predict functional

interactions)interactions)– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of predictionsIntegration and benchmarking of predictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – General (how do we predict functional General (how do we predict functional

interactions)interactions)– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of predictionsIntegration and benchmarking of predictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

How can we now predict / detect functional How can we now predict / detect functional associations?associations?

How can we now predict / detect functional How can we now predict / detect functional associations?associations?

• Functional genomics / high throughput Functional genomics / high throughput experimentsexperiments

• GENOMIC CONTEXTGENOMIC CONTEXT

• Functional genomics / high throughput Functional genomics / high throughput experimentsexperiments

• GENOMIC CONTEXTGENOMIC CONTEXT

functionally associated proteins leave functionally associated proteins leave evolutionary tracesevolutionary traces of their relation in genomes of their relation in genomes

functionally associated proteins leave functionally associated proteins leave evolutionary tracesevolutionary traces of their relation in genomes of their relation in genomes

We can thus detect We can thus detect evolutionary traces of a evolutionary traces of a functional association by functional association by comparing genomescomparing genomes

• Use the genome sequences Use the genome sequences ThemselvesThemselves (through (through comparative genome analysis) for interaction comparative genome analysis) for interaction prediction: genomic context methodsprediction: genomic context methods

• Use the genome sequences Use the genome sequences ThemselvesThemselves (through (through comparative genome analysis) for interaction comparative genome analysis) for interaction prediction: genomic context methodsprediction: genomic context methods

Genomic context is an tool to predict functional Genomic context is an tool to predict functional associations between genesassociations between genes

Genomic context is an tool to predict functional Genomic context is an tool to predict functional associations between genesassociations between genes

0 0.2 0.4 0.6 0.8 1Score

0

0.2

0.4

0.6

0.8

1

FusionGene OrderCo-occurrenceF

ract

ion

sam

e K

EG

G m

a p

•Genomic context Genomic context methods have been methods have been shown to be reliable shown to be reliable indicators for indicators for functional interactionfunctional interaction

• Genomic context is Genomic context is also known as also known as in silicoin silico interaction prediction, interaction prediction, or genomic or genomic associationsassociations

•Genomic context Genomic context methods have been methods have been shown to be reliable shown to be reliable indicators for indicators for functional interactionfunctional interaction

• Genomic context is Genomic context is also known as also known as in silicoin silico interaction prediction, interaction prediction, or genomic or genomic associationsassociations

http://string.embl.dehttp://string.embl.de

Three different genomic context methods in Three different genomic context methods in STRINGSTRING

Three different genomic context methods in Three different genomic context methods in STRINGSTRING

• Gene fusion, Rosetta stone method Gene fusion, Rosetta stone method • Conserved gene order between divergent Conserved gene order between divergent

genomes genomes • Co-occurrence of genes across genomes, Co-occurrence of genes across genomes,

phylogenetic profilesphylogenetic profiles

• Gene fusion, Rosetta stone method Gene fusion, Rosetta stone method • Conserved gene order between divergent Conserved gene order between divergent

genomes genomes • Co-occurrence of genes across genomes, Co-occurrence of genes across genomes,

phylogenetic profilesphylogenetic profiles

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methodsGenomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methodsGenomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

Gene fusionGene fusionGene fusionGene fusion

• i.e. the orthologs of two genes in another organism are i.e. the orthologs of two genes in another organism are fused into one polypeptide fused into one polypeptide

• A very reliable indicator for functional interaction; partly A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomes3470 distinct fusions when surveying 179 genomes

• i.e. the orthologs of two genes in another organism are i.e. the orthologs of two genes in another organism are fused into one polypeptide fused into one polypeptide

• A very reliable indicator for functional interaction; partly A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomes3470 distinct fusions when surveying 179 genomes

FusionFusionFusionFusion

Gene fusion: an exampleGene fusion: an exampleGene fusion: an exampleGene fusion: an example

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methodsGenomic context methods – GeneralGeneral– FusionFusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methodsGenomic context methods – GeneralGeneral– FusionFusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes• Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

Gene order evolves rapidlyGene order evolves rapidlyGene order evolves rapidlyGene order evolves rapidly

But …But …But …But …

Differential retention Differential retention of divergent / of divergent / convergent gene convergent gene pairs suggests that pairs suggests that conservation implies conservation implies a functional a functional associationassociation

““Operons”Operons”

Comparison to pathways conservation implies a functional Comparison to pathways conservation implies a functional associationassociation

Comparison to pathways conservation implies a functional Comparison to pathways conservation implies a functional associationassociation

1

10

100

1000

10000

0 3 6 9 12 15 18 21 24 27 30

co-occurrences in operons

num

ber

of C

OG

s

0

1

2

3

4

5

6

aver

age

met

abol

ic

dist

ance

number of COGS

average metabolicdistance

Conserved gene orderConserved gene orderConserved gene orderConserved gene order

• i.e. genes that are present over ‘sufficiently large’ i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusterevolutionary distances in the same gene cluster

• Contributes by far the most predictionsContributes by far the most predictions

• i.e. genes that are present over ‘sufficiently large’ i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene clusterevolutionary distances in the same gene cluster

• Contributes by far the most predictionsContributes by far the most predictions

Conserved gene orderConserved gene orderConserved gene orderConserved gene order

NB1 predicting operons is not trivial; in fact NB1 predicting operons is not trivial; in fact conserved gene order or functional conserved gene order or functional association is a major clueassociation is a major clue

NB2 using ‘only’ operons NB2 using ‘only’ operons without requiring without requiring conservationconservation results in much less reliable results in much less reliable function predictionfunction prediction

Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA

Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA

““query”query”““target”target”

Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA

Conserved gene order: an example from Conserved gene order: an example from metabolism of propionyl-CoA

Biochemical assays Biochemical assays confirm the function confirm the function of members of of members of COG0346 as a DL-COG0346 as a DL-methylmalonyl-CoA methylmalonyl-CoA racemase racemase

Biochemical assays Biochemical assays confirm the function confirm the function of members of of members of COG0346 as a DL-COG0346 as a DL-methylmalonyl-CoA methylmalonyl-CoA racemase racemase

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methodsGenomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes • Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methodsGenomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomesgenomes • Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

Presence / absence of genesPresence / absence of genesPresence / absence of genesPresence / absence of genes

Gene content Gene content co-evolution. (The easy case, few genomes. ) co-evolution. (The easy case, few genomes. )Gene content Gene content co-evolution. (The easy case, few genomes. ) co-evolution. (The easy case, few genomes. )

Genomes share genes for phenotypes they have in commonGenomes share genes for phenotypes they have in commonGenomes share genes for phenotypes they have in commonGenomes share genes for phenotypes they have in common

Differences between gene Differences between gene Content reflect differences inContent reflect differences inPhenotypic potentialitiesPhenotypic potentialities

Differences between gene Differences between gene Content reflect differences inContent reflect differences inPhenotypic potentialitiesPhenotypic potentialities

Presence / absence of genesPresence / absence of genesPresence / absence of genesPresence / absence of genes

L. innocua (non-pathogen)L. innocua (non-pathogen) L. monocytogenes (pathogen)L. monocytogenes (pathogen)

Presence / absence of genesPresence / absence of genesPresence / absence of genesPresence / absence of genes

L. innocua (non-pathogenic)L. innocua (non-pathogenic) L. monocytogenes (pathogenic)L. monocytogenes (pathogenic)

Genes involved in pathogenecity Genes involved in pathogenecity

Generalization: phylogenetic profiles / co-occurence

Generalization: phylogenetic profiles / co-occurence

Gene 1: Gene 2:Gene 3:....

Gene 1: Gene 2:Gene 3:....

spec

ies

1 sp

ecie

s 2

spec

ies

3

spec

ies

4

spec

ies

5 ..

....

..

... ..sp

ecie

s 1

spec

ies

2

spec

ies

3

spec

ies

4

spec

ies

5 ..

....

..

... ..

Gene 1: 1 0 1 1 0 1 Gene 2: 1 1 0 0 1 0Gene 3: 0 1 0 0 1 0....

Gene 1: 1 0 1 1 0 1 Gene 2: 1 1 0 0 1 0Gene 3: 0 1 0 0 1 0....

spec

ies

1 sp

ecie

s 2

spec

ies

3

spec

ies

4

spec

ies

5 ..

....

..

... ..sp

ecie

s 1

spec

ies

2

spec

ies

3

spec

ies

4

spec

ies

5 ..

....

..

... ..

Co-occurrence of genes across genomesCo-occurrence of genes across genomes

• i.e. two genes i.e. two genes have the same have the same presence/ absence presence/ absence pattern over pattern over multiple genomes: multiple genomes: they have ‘co-they have ‘co-evolved’evolved’

•AKA phylogenetic AKA phylogenetic profilesprofiles

Predicting function of a disease gene protein with Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence unknown function, frataxin, using co-occurrence

of genes across genomesof genes across genomes

Predicting function of a disease gene protein with Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence unknown function, frataxin, using co-occurrence

of genes across genomesof genes across genomes

• Friedreich’s ataxiaFriedreich’s ataxia• No (homolog with) known functionNo (homolog with) known function

• Friedreich’s ataxiaFriedreich’s ataxia• No (homolog with) known functionNo (homolog with) known function

A.aeolicus Synechocystis

B.subtilis

M.genitalium

M.tuberculosis

D.radiodurans

R.prow

azekii

C.crescentus

M.loti

N.m

eningitidis

X.fastidiosa

P.aeruginosa

Buchnera

V.cholerae

H.influenzae

P.multocida

E.coliA

.pernixM

.jannaschii

A.thaliana S.cerevisiae

s

C.jejuni

C.albicans

S.pombe

H.sapiens

C.elegan

H. pylori

D.m

elan.

cyaY Yfh1cyaY Yfh1

hscB Jac1hscB Jac1hscAhscA

ssq1ssq1

Nfu1Nfu1

iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1

Arh1Arh1

RnaMRnaMIscRIscRHypHyp

iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2

Atm1Atm1

Frataxin has co-evolved with hscA and hscB Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster indicating that it plays a role in iron-sulfur cluster

assemblyassembly

Frataxin has co-evolved with hscA and hscB Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster indicating that it plays a role in iron-sulfur cluster

assemblyassembly

Iron-Sulfur (2Fe-2S) cluster in the Rieske protein

Prediction:

Confirmation:

The opposite of co-occurrence:The opposite of co-occurrence:anti-correlation / complementary patterns: anti-correlation / complementary patterns:

predicting analogous enzymespredicting analogous enzymes

The opposite of co-occurrence:The opposite of co-occurrence:anti-correlation / complementary patterns: anti-correlation / complementary patterns:

predicting analogous enzymespredicting analogous enzymes

A B A B

Genes with complementary phylogenetic profiles tend to have a similar biochemical function.Genes with complementary phylogenetic profiles tend to have a similar biochemical function.

Complementary patterns in thiamin biosynthesis Complementary patterns in thiamin biosynthesis predict analogous enzymespredict analogous enzymes

Complementary patterns in thiamin biosynthesis Complementary patterns in thiamin biosynthesis predict analogous enzymespredict analogous enzymes

Morett E, Korbel JO, Rajan E, Saab-Rincon Morett E, Korbel JO, Rajan E, Saab-Rincon G, Olvera L, Olvera M, Schmidt S, Snel B, G, Olvera L, Olvera M, Schmidt S, Snel B, Bork P. Nature Biotech 2003Bork P. Nature Biotech 2003

Prediction of analogous enzymes is confirmedPrediction of analogous enzymes is confirmedPrediction of analogous enzymes is confirmedPrediction of analogous enzymes is confirmed

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomes genomes • Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomes genomes • Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

Benchmark and integration: KEGG mapsBenchmark and integration: KEGG mapsBenchmark and integration: KEGG mapsBenchmark and integration: KEGG maps

0 0.2 0.4 0.6 0.8 1Score

0

0.2

0.4

0.6

0.8

1

FusionGene OrderCo-occurrenceF

ract

ion

sam

e K

EG

G m

a p

Integrating genomic context scores into one Integrating genomic context scores into one single scoresingle score

• Compare each individual method against an independent benchmark Compare each individual method against an independent benchmark (KEGG), and find “equivalency”(KEGG), and find “equivalency”• Multiply the chances that two proteins are Multiply the chances that two proteins are not not interacting and subtract interacting and subtract from 1; naive bayesian i.e. assuming independencefrom 1; naive bayesian i.e. assuming independence

BenchmarkBenchmarkBenchmarkBenchmark

0.5 0.6 0.7 0.8 0.9 1.0

Accuracy (fraction of confirmed predictions, i.e. same KEGG map)

10

100

1000

10000

100000

Fusion (norm.)Fusion (abs.)

Gene Order (norm.)Gene Order (abs.)Cooccurrence

Integrated

Co

ver

age

(n

umbe

r o

f pre

dic

ted

l ink

sb

etw

ee

n o

r tho

log

ou

s g

r ou

ps)

Accuracy

Co

ver

age

purifiedcomplexes

TAP

yeast two-hybrid

two methods

three methods

PurifiedComplexesHMS-PCI

combinedevidence

mRNAco-expression

genomic context

syntheticlethality

fra

cti

on

of

refe

ren

ce

se

t c

ov

ere

d b

y d

ata

fraction of data confirmed by reference set

filtered data

raw data

parameter choices

Performance of genomic context compared to Performance of genomic context compared to high-throughput interaction datahigh-throughput interaction data

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomes genomes • Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – GeneralGeneral– Gene fusionGene fusion– Gene orderGene order– Presence / absence of genes across Presence / absence of genes across

genomes genomes • Integration and benchmarking of Integration and benchmarking of

predictionspredictions• Biochemistry by other means BolA Biochemistry by other means BolA • In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

Data-mining proteins for protein function prediction: BolA

An interaction of BolA with a mono-thiol An interaction of BolA with a mono-thiol glutaredoxin ?glutaredoxin ?

(STRING) (STRING)

BolABolA

BolA and Grx occur as neighbors in a number of genomesBolA and Grx occur as neighbors in a number of genomes

BolaGrx

BolA and Grx have an (almost) identical phylogenetic distributionBolA and Grx have an (almost) identical phylogenetic distribution

BolA and Grx have been shown to interact in Y2H in BolA and Grx have been shown to interact in Y2H in S.cerevisiaeS.cerevisiae and and D.melanogasterD.melanogaster, and in Flag tag in , and in Flag tag in S.cerevisiaeS.cerevisiae

BolA phylogeny

BolA does have (predicted) interactions with cell-division / cell-wall proteins. Those appear secondary to the link with GrX

STRING has obtained a higher resolution in function prediction than phenotypic analyses

Cell division / Cell wallCell division / Cell wall (oxidative) stressoxidative) stress

BolA is homologous to the peroxide reductase OsmC, BolA is homologous to the peroxide reductase OsmC, suggesting a similar functionsuggesting a similar function

OsmC uses thiol groups of two, evolutionary conserved OsmC uses thiol groups of two, evolutionary conserved cysteines to reduce substratescysteines to reduce substrates

Problem: The BolA family does not have conserved Problem: The BolA family does not have conserved cysteines. cysteines.

……It would have to obtain its reducing equivalents from It would have to obtain its reducing equivalents from elsewhere…elsewhere…

BolA family alignmentBolA family alignment

BolA is (homologous to) a reductaseBolA interacts with GrX?

GrX provides BolA with reducing equivalents !? (or “scaffolding?”)

Prediction of interaction partner and molecular function complement each otherPrediction of interaction partner and molecular function complement each other

Genomic context: biochemistry by other meansGenomic context: biochemistry by other meansGenomic context: biochemistry by other meansGenomic context: biochemistry by other means

Despite the high performance of genomic context Despite the high performance of genomic context methods, as a tool for function prediction it is not a methods, as a tool for function prediction it is not a button press methodbutton press method

It is more like biochemistry by other means.It is more like biochemistry by other means.

Often quite a lot of manual input and expert Often quite a lot of manual input and expert knowledge from the researcher is needed to distill knowledge from the researcher is needed to distill associations into a concrete function predictionassociations into a concrete function prediction

Small-scale bioinformatics?Small-scale bioinformatics?

ContentsContentsContentsContents

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – GeneralGeneral– FusionFusion– Gene orderGene order– Co-occurrence across genomesCo-occurrence across genomes

• Integration and benchmarking of Integration and benchmarking of predictionspredictions

• Interaction networksInteraction networks• In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

• Predicting functional interactions between Predicting functional interactions between proteinsproteins

• Genomic context methods Genomic context methods – GeneralGeneral– FusionFusion– Gene orderGene order– Co-occurrence across genomesCo-occurrence across genomes

• Integration and benchmarking of Integration and benchmarking of predictionspredictions

• Interaction networksInteraction networks• In addition to genomic context: functional In addition to genomic context: functional

genomics datagenomics data

STRING currently in addition includes:STRING currently in addition includes:

• Functional association data from large scale / high-Functional association data from large scale / high-throughput biochemical experiments (functional throughput biochemical experiments (functional genomics data)genomics data)

• protein complex purificationprotein complex purification

• yeast-2-hybridyeast-2-hybrid

• ChIP-on-chipChIP-on-chip

• micro-array gene expressionmicro-array gene expression

• “ “known” functional relations, so called “legacy data”, known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like as present in PubMed abstracts and databases like MIPS or KEGG. MIPS or KEGG.

STRING currently in addition includes:STRING currently in addition includes:

• Functional association data from large scale / high-Functional association data from large scale / high-throughput biochemical experiments (functional throughput biochemical experiments (functional genomics data)genomics data)

• protein complex purificationprotein complex purification

• yeast-2-hybridyeast-2-hybrid

• ChIP-on-chipChIP-on-chip

• micro-array gene expressionmicro-array gene expression

• “ “known” functional relations, so called “legacy data”, known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like as present in PubMed abstracts and databases like MIPS or KEGG. MIPS or KEGG.

• Handing out article and questions : Handing out article and questions : ““Identification of a bacterial regulatory Identification of a bacterial regulatory system for ribonucleotide reductases by system for ribonucleotide reductases by phylogenetic profiling.phylogenetic profiling.” Read the article ” Read the article and hand in the answers to the and hand in the answers to the questions by Monday November 28th. questions by Monday November 28th.