BIOINFORMATIC ANALYSES OF PROTEIN-PROTEIN INTERACTION NETWORKS UE Systems Biology and Complex...

Preview:

Citation preview

BIOINFORMATIC ANALYSES OF PROTEIN-PROTEIN INTERACTION

NETWORKS

UE Systems Biology and Complex SystemsMaster Biosciences

Lyon 05/20/2010&

Biology Summer SchoolMarseille 09/01/2010

Christine Brun, , Marseille

A protein never acts alone…

…but interacts with others to perform its function.

Molecular interactions :

- protein-DNA- protein-RNA- protein-protein

- protein-lipid- protein-small molecule

MolecularMovies.org The Inner Life of the Cell

THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (1/3)

• The interaction occurs between identical molecules homo-dimers, trimers, tetramers.....

1- Structural diversity

Ferritine, 24 identical polypeptides

• The interaction occurs between different polypeptides hetero-dimers, trimers, tetramers.....

RNApolymerase, 12 different polypeptides

2- Functional Diversity

Interactions within the JAK-STAT signaling pathway in drosophila

• Non-obligatory PPI :

Proteins are stable independantly.Proteins are functional independantly.

The interaction performs an action.

Ex: antigen-antibody reaction, enzymatic reaction, phosphorylation reaction (signaling…)

THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (2/3)

• Obligatory PPI :

Proteins are not stable independantly Proteins are not functional independantly.

The interaction is necessary to stability and function.

Ex: protein complexes (DNA polymerase, RNA polymerase, ribosome…)

• Non-obligatory PPI :

Proteins are stable independantly.Proteins are functional independantly.

The interaction performs an action.

Ex: antigen-antibody reaction, enzymatic reaction, phosphorylation reaction (signaling…)

2- Functional Diversity

THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (2/3)

2- Functional Diversity

Interactions within a complex : the yeast

proteasome

THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (3/3)

• Obligatory PPI :

Proteins are not stable independantly Proteins are not functional independantly.

The interaction is necessary to stability and function.

Ex: protein complexes (DNA polymerase, RNA polymerase, ribosome…)

3- Dynamic Diversity• Transient PPI :

Associate and dissociate in vivo.

Transient PPI may be non-obligatory.

Interactions within the JAK-STAT signaling pathway in drosophila

• Permanent PPI :

Exist only in complexes.

Permanent PPI generally correspond to obligatory PPI. Interactions within a

complex : the yeast proteasome

THE DIVERSITY OF P-P INTERACTIONSfrom Nooren & Thornton (3/3)

A protein never acts alone…

…but interacts with others to perform its function.

Cell

Molécule

Tissue, organ

Organism

Population

Protein Function : a complex notion

Molecular Function

Cellular Function

Physiology

Development, reproduction

Ecologicalequilibrium

Different integration levels

Molecular function of the proteins

Molecular activityexamples: Kinase, ATPase, DNA binding...

Biochemical analyses Bioinformatic predictions: similarity search between sequences and structures

But...

~30% of the genes/proteins of each newly sequences organism do not show any similarity with any known gene.

Cellular function of the proteins

Biological processexamples: signaling, transcription, establishment of the

epithelium

Genetic Analyses Cellular Biology

But...

Sharan et al, Mol Syst Biol, 2007

Protein function: functional annotations are also missing in E.

coli

Bouveret & Brun, MethMolBio, 2010

Interaction analysis allows investigating protein cellular functions.

Interactions within a process: the JAK-STAT signaling pathway in

drosophila

Interactions within a complex : the yeast

proteasome

How to study the protein cellular functions ?

Cellular function = Biological process

HOW TO IDENTIFY PROTEIN-PROTEIN INTERACTIONS AT THE WHOLE PROTEOME

SCALE?

Two high-throughput methods:

- Two-hybrid screens

- Affinity Purification followed by Mass Spectrometry

BS

AD

Gene

Transcription factor:DNA Biding site BS

+ Transcription activation domain AD, able to activate the basal transcription

machinery

Transcirption factor biding site

Transcription factor Transcription factor Messenger RNA

Messenger RNA

Messenger RNAMessenger RNA

Messenger RNA

Messenger RNA

Messenger RNAMessenger RNABS

AD

THE MODULARITY OF THE TRANSCRIPTION FACTORS,as an elementary principle of the yeast two-hybrid

BSBait X

ADPrey Y

Repoter Gene

YEAST TWO-HYBRIDPrinciple of the test

• The bait protein is fused to the BS of a transcription factor.• The prey proteins (potential interactors) are fused to the activation domain of a transcription factor.

• The fusion proteins are expressed in a yeast strain containing a reporter gene under the control of BS.

BSBait X

ADPrey Y

• When the prey Y interacts with the bait X, the activation domain AD gets close to the gene promotor and the transcription can happen.

Reporter Gene

Reporter RNAReporter RNA

Reporter RNAReporter RNA

Reporter RNA

Reporter RNA

Reporter RNAReporter RNA

ADPrey Y

YEAST TWO-HYBRIDPrinciple of the test

THE LARGE-SCALE TWO HYBRID SCREENS

S. cerevisiae Uetz et al., 2000Ito et al., 2001

P. falciparum Lacount et al., 2005 C. elegans Li et al., 2004D. melanogaster Giot et al., 2003

Stanyon et al., 2004 Formstecher et al., 2005

H. sapiens Stelzl et al., 2005Rual et al., 2005

T. pallidum Titz et al., 2008H. pilori Rain et al., 2001C. jejuni Parrish et al., 2007Synechocystis Sato et al., 2007Mesorhizobium Shimoda et al., 2008

+ virus (bacteriophage T7, vaccine, HCV, BPV, Herpes, EBV…)

+ host-virus (HCV-human, EBV-human)

Protein reconstitution

Bouveret & Brun, MethMolBio, 2010

OTHER LARGE SCALE METHODS BASED ON THE TWO-HYBRID PRINCIPLE (1/2)

Mappit, a functional complementationassay (cytokine receptor signalingpathway)

Bouveret & Brun, MethMolBio, 2010

OTHER LARGE SCALE METHODS BASED ON THE TWO-HYBRID PRINCIPLE (2/2)

HOW TO IDENTIFY PROTEIN-PROTEIN INTERACTIONS AT THE WHOLE PROTEOME

SCALE?

Two high-throughput methods:

- Two-hybrid screens

- Affinity Purification followed by Mass Spectrometry

Tag Bait Y

antibody anti-Tag

PRINCIPLE OF COMPLEX PURIFICATION

PROTOCOL FOR COMPLEX PURIFICATION

Different types of TAG formed by 2 parts, separated by a clivage site allow 2 steps of purification

Bouveret & Brun, MethMolBio, 2010

PROTOCOL FOR COMPLEX PURIFICATION

Bouveret & Brun, MethMolBio, 2010

AP/MS analyses

S. cerevisiae Gavin et al., 2002, 2006Ho et al., 2002Krogan et al., 2006

E. coli Butland et al., 2005Arifuzzaman et al., 2006Hu et al., 2009

M. pneumoniae Kühner et al., 2009

D. melanogaster Perrimon lab, 2009

(+ signaling pathways in drosophila and human)

HOW TO IDENTIFY PROTEIN-PROTEIN INTERACTIONS AT THE WHOLE PROTEOME SCALE?

The two high-throughput methods do not detect the same interaction types :

- the yeast two-hybrid method detects interactions which are biophysically possible, transient or

permanent.

- the tandem affinity purification method identifies permanent interactions in vivo.

FINDING INTERACTION DATA?

InternationalMolecularExchange

Consortium

Specialized Databases

Multi-organisms:DIP (dip.doe-mbi.ucla.edu)IntAct (www.ebi.ac.uk/intact)MINT (mint.bio.uniroma2.it/mint)BioGRID (www.thebiogrid.org)BIND (www.blueprint.org)

Yeast:MPact (mips.gsf.de/genre/proj/mpact)

Human:HPRD (www.hprd.org)Reactome (http://reactome.org/)

Meta-database

APID (bioinfow.dep.usal.es/apid/index.htm)

EXAMPLE 1: INTACT

EXAMPLE 1: INTACT

EXAMPLE 2: APID METABASE

A standardized representation of the interactions is proposed for databases. Authors are invited to submit their interactions to databases according to this format when they submit their publication towards a better record of the knowledge

Interactions described in databases can be represented as networks

= universal language to describe complex systems

Disease Spread

[Krebs]

Social Network

Food Web

Neural Network[Cajal]

ElectronicCircuitInternet

[Burch & Cheswick]

Non oriented graphNode = protein

Edge = physical interaction

PROTEIN-PROTEIN INTERACTION NETWORK

WHAT IS AN INTERACTOME ?The set of all possible protein-

protein interactions between all the proteins of an organism.

Jeong et al., 2001 Li et al., 2004

Formstecher et al., 2005 Rual et al., 2005

FAQ about INTERACTOMES

The set ofall possible protein-protein interactions between all the

proteins of an organism.

Are all detected interactions

physiological?

What’s the sizeof the interaction

space?

BUT...Interactomes do not contain spatio-temporal information 2D maps,

projections, long-exposure photographs...

~75 000 to 350 000 interactions in human

They are physically possible

SHOULD WE TRUST LS-Y2H ? (1/4)

Giot et al.Science 2003

Formstecher et al.Genome Res 2005

Stanyon et al. Genome Biol 2004

• Low overlap between experiments• The size of the complete set of interactions being unknown False-positive or barely overlapping sub-sets?

Comparison of the results of the 3 drosophila LS-Y2H screens

Formstecher et al.Genome Res 2005

Comparison of the results of 2 drosophila LS-Y2H…

Yes, we should since interactions detected in LS-Y2H are possible interactions.

SHOULD WE TRUST LS-Y2H ? (2/4)

20416 203723

192 63823

…only 30 baits in common

Giot et al.Science 2003

Braun et al., Nature Meth 2009

Different detection methods do not detect p-p interactions with the same efficiency

SHOULD WE TRUST LS-Y2H ? (3/4)

Positive interactions(17-21%)

From Venkatesan et al., Nat. Meth. 2009

SHOULD WE TRUST LS-Y2H ? (4/4)

Negative interactions(false positive rate0.5-2%)

1- SENSITIVITY:

2- COVERAGE:

Low sensibility rather than low

specificity

AND TAP-TAG? (1/2)

Overlap of the interactions detected in the 3 TAP-TAG screens performed in E. coli.

Hu et al., Plos Biol 2009

AND TAP-TAG? (2/2)

Overlap of the interactions detected in the 1 TAP-TAG screen performed in M. pneumoniae and interactions identified/inferred by other means.

Kühner et al., Science 2009

WHAT ABOUT LITTERATURE?

Cusick et al., Nature Meth 2009

Discovery Science : • Knowing all the parts• ~30 % of the genes/sequenced genomes proteins of unkonwn function

Predict/discover protein function

Systems biology approach :• Analyze the interactome properties

bring some new insights into old and novel biological questions

MOTIVATIONS OF INTERACTOMICS

1- Description

network organisation (stat, graph theory…)

- Protein degree- Edge-betweenness- K-core- Diametre....

POSSIBLE APPROACHES

• The protein degree number of neighbours

• If the network is directed, kin et kout

k = 4k = 4

kin = 1

kout

= 3

kin = 1

kout

= 3

PROTEIN DEGREE

• Protein degree distribution:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

0.2

2

20

200

2000

levure S. cerevisae

connectivité k

nom

bre

de g

ènes

A lot of proteins are poorly connected

Some proteins are highly connected = « hub »

What does it mean in biology?

PROTEIN DEGREE

…when the airline traffic network is organized as a protein-protein interaction network ???

Power-law distribution

WHAT DOES IT MEAN BIOLOGICALLY?

‘EDGE BETWEENESS’

a

b

c

d

f

e g

h

i j

Number of shortest paths going through an edge

(computed between all node pairs)

Bio: Centrality

Processes are connected by interactions of high

betweenness.

a

b

c

d

f

e g

h

i j

1

8

8

3,5

15

3,55,5

5,5

24

7

14

2

9

(Can also be used to disconnect the graph)

b

d

a

c

X

X

X

X

X

X X

a,b,c,d,e belong to core 1

e

K-CORE NOTION or how to peel the interactome…(1/3)

* Recursively remove vertices/proteins according to their number of neighbours

i

g

l

f

k

h

X

X

XX

X

X

a,b,c,d,e belong to core 1f,g,h,i,j,k,l belong to core 2…

X

X X

X

XX

j

K-CORE NOTION or how to peel the interactome…(2/3)

* Recursively remove vertices/proteins according to their number of neighbours

X

K-CORE NOTION or how to peel the interactome…(3/3)

Bio: Central proteins vs. peripheral proteins

Functional differences ?

X

XPotential difficulties:

It is often difficult to match a graph property/characteristic to a biological role/property…

…increased when the graph/interactome does not contain any spatio-temporal information

Towards data integration

• The p-p interaction network is a static view• All interactions do not happen in the same

time at the same place! • ‘Dynamic’ information: expression data from

transcriptome experiments.

ex

pre

ss

ion

de

s g

èn

es

temps

on off on

EXAMPLE OF DATA INTEGRATION (1/2)

• Different kinds of hubs:

1st possibility:Simultaneous interactions « party hubs »

1st possibility:Simultaneous interactions « party hubs »

2nd possibility:Successive interactions« date hubs »

2nd possibility:Successive interactions« date hubs »

M phase of the cell cycle

S phase of the cell cycle

[Han et al., Nature 2004]

Inter-processcommunication

Intra- processrole

EXAMPLE OF DATA INTEGRATION (1/2)

2- Functional module identification for function prediction and systems biology

classification and graph partitioning

POSSIBLE APPROACHES

from Sharan et al., Mol Syst Biol, 2007

FUNCTION PREDICTION: 2 types of methods

Function prediction Function prediction+ Systems biology

FUNCTION PREDICTION : direct method

Inferrence of the function of an uuncharacterized protein by transfer of its neighbour’s functions.

- majority rule- functional flux- ...

Identification of groups of proteins. Inferrence of the group function.

- density- distances- edge-betweenness/betweenness cut- optimisation of criterion: modularity ( higher nb of internal edges / random partition of the graph, with the same class cardinals)- ...

FUNCTION PREDICTION: module detection

EXAMPLE 1: IDENTIFICATION OF MODULES BASED ON EDGE DENSITY

• What is dense zone ?

• « rigourous » definition:

not dense... ...rather dense !

maximal nb of connections between N proteins is ½N(N-1)Density is defined as:

maximal nb of connections between N proteins is ½N(N-1)Density is defined as:

d =connection nb

maximal nb of connections

d=6/21=0.28 d=14/21=0.67

FUNCTIONAL MODULES IDENTIFIED BASED ON EDGE DENSITY

Cell cycle regulationCell cycle regulation

Signaling pathway triggered by pheromonesSignaling pathway triggered by pheromones

Spirin & Mirny, PNAS 2003

EXAMPLE 2: THE PRODISTIN METHOD, A FUNCTIONAL CLASSIFICATION BASED ON INTERACTIONS

1-

The Czekanowski-Dice distance (Dice, 1945)

3- A classification tree

Brun et al., Genome Biology, 2003; Baudot et al., Bioinformatics, 2006

4- Annotated classification tree

+ GO annotations

2-

PRINCIPLE: …calculate a distance based on the number of interactors shared and unshared by protein pairs, reflect of

their functional similarity

A B

D

C

HYPOTHESIS: the more proteins share common interactors, the more likely they are functionally related

A POSSIBLE TRANSLATION OF THE BIOLOGICAL HYPOTHESIS:

THE CZEKANOWSKI-DICE DISTANCE

| XY | + | XY |

| X \ (XY) | + | Y \ (XY) | D(X, Y) =

8 + 3

2 + 3 =

(Dice, Ecology, 1945)

1-

The Czekanowski-Dice distance (Dice, 1945)

3- A classification tree

Brun et al., Genome Biology, 2003; Baudot et al., Bioinformatics, 2006

4- Annotated classification tree

+ GO annotations

2-

EXAMPLE 2: THE PRODISTIN METHOD, A FUNCTIONAL CLASSIFICATION BASED ON INTERACTIONS

[CLASS: 76]CA sensory_organ_development, neuroblast_division, cytoskeleton_organization_and_biogenesisP# 5PN pros, mira, insc, numb, baz

[CLASS: 60]CA myoblast_fusionP# 7PN sls, Mhc, Act88F, Actn, rols, mbc, Crk

FUNCTIONAL MODULES IDENTIFIED BY THE

COMPUTATION OF A DISTANCE

Functional classes = groups of proteins involved in the same pathway, the same protein complex or the same cellular process through interactions

PROTEIN FUNCTION PREDICTION: EXAMPLE OF THE'DNA METABOLISM' and 'CELL CYCLE‘ CLASS

Pre-Replication Complex

Protein kinase Complex

Telomere Replication

Targets PRC to origin

Cell cycle control

Chromatine Structure

??Telomere tethering to the nuclear periphery

POSSIBILITY OF FUNCTION INFERRENCE

PREDICTION OF CELLULAR FUNCTION

93 uncharacterised proteins

42 belong to PRODISTIN classes

a cellular function is predicted

2 new predictions (5%)

40 predicted by other bioinformatic methods

or recentlycharacterised

experimentally

+

27 in agreement (64%)

13 different (30%)Brun et al., Genome Biol, 2003

STATISTICAL EVALUATION OF CLASS QUALITY AND FUNCTIONAL PREDICTIONS (1/2)

- Is the protein clustering significant? Would it happen by chance? Test on random networks of the same topology – ‘Reshuffling’ 15 classes instead of 64 Significant

- What is the functional prediction quality? Suppose that all members of a class all perform the function assigned to the class; compare these predictions with known functions. Success rate = # correctly predicted functions/ # predictions 67 % (vs 43 % for Majority Rule Algorithm)

- What is the class quality? Class Robustness Index (CRI). Based on tree topological criteria (bootstrap for a distance-based method) 0 < 0.96 < 1 Brun et al., Genome Biol, 2003

- How do PRODISTIN deal with noise in interaction data? Is it robust toward the presence of both spurious and missing interactions in the dataset? Test on networks of different topologies – ‘Rewiring’ What is the prediction success rate?

STATISTICAL EVALUATION OF CLASS QUALITY AND FUNCTIONAL PREDICTIONS (2/2)

0

100

200

300

400

500

600

0 10 20 30 40 50

% rewiring

# p

rote

ins

pre

dic

ted

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

pre

dic

tio

n r

ate

Brun et al., Genome Biol, 2003

INTERACTOMES AND THE EVOLUTION OF THE FUNCTION OF THE DUPLICATED GENES

WHAT CAN WE LEARN ABOUT THE EVOLUTION OF THE FUNCTION OF THE DUPLICATED GENES WHEN STUDYING THE INTERACTOME WITH THE CLASSIFICATION METHOD ?

THE ANCESTRAL WHOLE GENOME DUPLICATION IN YEAST

UPDATE RESULTS 2004 2007

• 100-150 million years ago

• After the Kluyveromyces waltii and S. cerevisiae divergence (Kellis et al., 2004)

• Followed by massive deletion events

• 16% of the present genome is formed by WGD paralog pairs, remnants of this duplication event 457 - 460 pairs (Wolfe & Shields, 1997; Seoighe & Wolfe, 1999; Kellis et al., 2004)

THE ANCESTRAL WHOLE GENOME DUPLICATION IN YEAST

Kellis et al., 2004

A

Duplication t0

t1

t2

A’

A A4 shared interactors

A’’2 shared interactors2 specific interactors

A’A’’’2 shared interacteurs4 specific interactors

EVOLUTION OF THE NUMBER AND THE IDENTITY OF THE PARALOGUES INTERACTORS AFTER A GENOME

DUPLICATION

YEAST PPI NETWORK 2004

Interactions 3991 (Core data LS-Y2H+ Homemade literature curation)

17656

Proteins 2644 (~ 46% ORFs)

4773(82,3% ORFs)

Mean degree 3 7,4

Paralogs pairs in the

classification tree

38(8% paralogs)

172(37%

paralogs)

2004

38 paralog pairs

How are they classified?

Functional classification tree for yeast proteins

(2004)

Both paralogs are in the

same class

Functional classification tree for yeast proteins

(2004)

Paralogs are classified in

different classes devoted to the same cellular

function

Paralogs are classified in

different classes devoted

to different cellular function

Functional classification tree for yeast proteins

(2004)

THREE CLASSIFICATION BEHAVIOURS

43(25%)

21(12,2%)

9(24%)

Different class,Different function

13(7,6%)

3(8%)

Different class,Same function

95(55,3%)

26(68%)

Same class,Same function

2004 2007

• The majority of the WGD paralogs (68%) are in the same functional class share interactors.• The majority of the WGD paralogs are involved in the same cellular function (76%).

Different class, Different function

24%

FunctionalDivergence

-

+

Different class,Same function

8%

Same class, Same function

68%

EVOLUTION OF CELLULAR FUNCTION: A SCALE OF FUNCTIONAL DIVERGENCE FOR DUPLICATED

GENES BASED ON INTERACTION ANALYSIS

EVOLUTION BY NEO-FUNCTIONALIZATION

(Ohno, 1970)

EVOLUTION BY SUB-FUNCTIONALIZATION

(Force et al., 1999)

Baudot et al., 2004, Genome Biology, 5: R76

• Sequence conservation does not necessarily imply function conservation Complex relationship between sequence identity and cellular function • The functional classification shows paralog properties not detectable by sequence analysis alone.

CLASSIFICATION BEHAVIOURS AND SEQUENCE IDENTITY

Same class, Same function

Different class, Same function

Different class, Different function

Not classified

* A scale of functional divergence for the duplicated genes based on the interactions Distincts scenarii for the evolution of the cellular function of the duplicated genes.

* Differences between paralog pairs neither detectable by sequence analysis nor by functional annotation analysis a novel type of information by the analysis of the interactions.

(Baudot et al., Genome Biology, 2004)

Update 2004-2007:

* Stability of the results with a 4 times larger interactome:

- high quality interactome, even small, gives reliable biological information

- robustness of the network analysis method to false negative/positive (Brun et al., Genome Biology, 2003).

- changes in functional annotations (‘knowledge’ effect) may change the biological interpretation of results.

INTERACTION NETWORK AND EVOLUTION OF THE FUNCTION OF THE DUPLICATED GENES :

CONCLUSIONS

Recommended