PPI network construction and false positive detection Jin Chen CSE891-002 2012 Fall 1

1

PPI network construction and false positive detection

Jin ChenCSE891-002

2012 Fall

2

Layout

• Protein-protein interaction (PPI) networks

• PPI network construction

• PPI network false-positive detection

3

Background• Study of interactions between proteins is fundamental to the

understanding of biological systems

• PPIs have been studied through a number of high-throughput experiments

• PPIs have also been predicted through an array of computational methods that leverage the vast amount of sequence data generated

• Comparative genomics at sequence level has indicated that species differences are due more to the difference in the interactions between the component proteins, rather than the individual genes themselves *

* Valencia A, Pazos F: Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 2002, 12:368-373.

4Nidhi et al. DSiMB 2009

PPI at different levels

3D structure

Protein folding

Protein docking

Domain

5

Hawoong Jeong

PPI at different levels

Node – proteinEvery node represents an unique protein

Edge – protein interactionPhysical interactionFunctional interaction

6

PPI Identification

• Concept of PPI ranges from direct physical interactions inferred from experimental methods (yeast two-hybrid) to functional linkages predicted on the basis of computational analysis (based on protein sequences and structures )

• Given the difficulties in experimentally identifying PPIs, a wide range of computational methods have been used to identify functional PPIs

7

Domain FusionHypothesis: if domains A and B exist fused in a single polypeptide AB in another organism, then A and B are functionally linked

Marcotte EM et al. Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285(5428) 751-753 1999

8

Domain Fusion• Inclusion of eukaryotic sequences increased the robustness of domain

fusion predictions *

• Eukaryotes, with a larger volume, cannot afford to accommodate separate proteins A and B, as the required concentrations of A and B would be prohibitively high, to achieve the same equilibrium concentration of AB.

• Limitation: low coverage

*Veitia RA: Rosetta Stone proteins: "chance and necessity"? Genome Biol 2002,3(2):interactions1001.1-1001.3.

9

Conserved NeighborhoodHypothesis: If the genes that encode two proteins are neighbors on the chromosome in several genomes, the corresponding proteins are likely to be functionally linked

Dandekar T et al. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochemical Sci 1998 , 23(9):324-328

10

Conserved Neighborhood• The method has been reported to identify high-quality functional

relationships

• The method suffers from low coverage, due to the dual requirement of identifying orthologues in another genome and then finding those orthologues that are adjacent on the chromosome

Marcotte EM: Computational genetics: finding protein function by nonhomology methods. Curr Opin Struct Biol 2000 , 10:359-365

11

Phylogenetic Profiles• Hypothesis: functionally linked proteins would co-occur in genomes• Phylogenetic profile of a protein can be represented as a 'bit string',

encoding the presence or absence of the protein in each of the genomes considered

12

Co-evolution• Hypothesis: Co-evolution requires the existence of mutual selective

pressure on two or more species• in silico Two-hybrid (i2h) method has been proposed based on the study

of correlated mutations in multiple sequence alignments

Pazos F et al: In silico Two-Hybrid System for the Selection of Physically Interacting Protein Pairs. Proteins 2002, 47:219-227

Protein family A

Protein family B

13

Software: Protein Link Explorer (PLEX)

Date, S.V. and E.M. Marcotte, Protein function prediction using the Protein Link EXplorer (PLEX). Bioinformatics, 2005. 21(10): p. 2558-2559.

14

Biological Problem Algorithm Knowledge

1. Biological hypothesis

2. Mathematical representation

3. Algorithm design

4. Biological verification

15

High-throughput PPI Detection

• Booming of biotechnology– Yeast-two hybrid / split ubiquitin system– Mass spectrometry– Protein microarrays– etc.

• Limitations of computational prediction– Low coverage– Locally optimized (pair-wise)– Super-high negative PPI rates

Yeast Two-Hybrid• Two hybrid proteins are generated with transcription

factor domains• Both fusions are expressed in a yeast cell that carries a

reporter gene whose expression is under the control of binding sites for the DNA-binding domain

Reporter Gene

BaitProtein

BindingDomain

Prey Protein

ActivationDomain

Yeast Two-Hybrid• Interaction of bait and prey proteins localizes the

activation domain to the reporter gene, thus activating transcription

• Since the reporter gene typically codes for a survival factor, yeast colonies will grow only when an interaction occurs

Reporter Gene

BaitProtein

BindingDomain

Prey Protein

ActivationDomain

Mating based Split-ubiquitin System

Lalonde S et al. Plant J 2008

Bio

mas

s

The trends for yeast cell growth over time

Yeast Cell Growth Rate

20

PPI Databases• STRING – PPIs derived from high-throughput experimental data, mined of

databases and literature, analyses of co-expressed genes and also from computational predictions

• HPRD - Human Protein Reference Database. It integrates information relevant to the function of human proteins in health and disease

• DIP - Experimentally derived PPIs with assessments. DIP is generally considered as a valuable benchmark or verify the performance of any new method for prediction of PPIs

• Many others: MIPS, YGD, BIND, TAIR…

21

False-Positive Detection in PPI Networks

• Background: PPI networks generated with high-throughput methods contain a sizeable number of false-positives and their reproducibility is not satisfactory*

• Central to the understanding of PPI is the definition of “interaction” itself– Binding energy / Interaction / Complex– We need to define what we mean by interaction

* von Mering Comparative assessment of large-scale data sets of protein-protein interactions. Nature ;417(6887):399-403 2002

22

Useful Data for False-Positive Detection

• Functional and localization data (Gene Ontology)

• Indirect high-throughput data (gene and protein expression)

• Sequence related data ( protein domain (domain fusion), interologs)

• Structure data (protein 3D structure)

• Network topological features (connectivity, network motif)

23

Different Hypothesis for Different Data

Data Example of Hypothesis

Gene Ontology Two proteins which share a similar annotation are more likely to interact than proteins with different or null annotations

Gene Expression Two proteins which have similar genes express patterns are more likely to interact

Domain Interaction If two domains are often found in PPIs, two proteins containing such domains are more likely to interact

PPI network topological analysis

PPI topologies fit spoke or matrix models are more likely to be true

Other hypotheses include: synthetic lethality, interlogs, linear motif, etc.

24

Gold Standard for PPI Networks

• For algorithm evaluation and comparison• To train a model as positive training data

• Manually annotated databases such as DIP• Interactions from low-throughput experiments

• True negative set is equally important– Co-localized? No?

25

Estimate PPI Network Reliability

• Overall index of reliability of a PPI network

TPprecision

TP FP

TPrecall

TP FN

26

Estimate PPI Network Reliability

“capture-recapture” model - reaching back to the raw counts of observed bait–prey clones of yeast-two hybrid experiments

Huang et al. Where Have All the Interactions Gone? Estimating the Coverage of Two-Hybrid Protein Interaction Maps. PLoS Computational Biology 2007

27

PPI Filtering

• GOAL: To identify reliable protein complexes from two existing mass spectrometry (MS) data

• Analyze the data with a purification enrichment (PE) scoring system

• Using gold standard PPIs, the consolidated dataset is of greater accuracy than the original sets and is comparable to PPIs defined using more conventional small-scale methods

Collins et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 2007 Mar;6(3):439-50

28

PPI Filtering

• e=0 no evidence for or against the validity of a particular interaction was collected

• Two types of observations: bait-prey observations and prey-prey observations

• i and j are two proteins (bait & prey). k indicates a distinct purification. Mij measures indirect evidence due to co-occurrence of proteins i and j as preys in the same purifications

10

( | )log

( | )observation

P observation TruePPIe

P observation FalsePPI

ij ijk jik ijk k

PE e e M

29

PPI Filtering

where r representing the probability that a true association will be preserved and detected in a purification experiment and pijk representing the probability that a bait-prey pair will be observed for nonspecific reasons

where nikprey is the number of preys identified in purification k with bait i, ni

bait is the number of times protein i was used as bait, and fj is an estimate of the nonspecific frequency of occurrence of prey j in the dataset

10

(1 )log ijk

ijkijk

r r pe

p

1 exp( )jprey bait

ijk i ikp f n n

30

PPI Filtering

31

PPI Filtering

32

PPI Filtering

• PPI topological analysis

– First student presentation is about a topological measure called “FS-weight”, which was compared with other topological measures

– Suitable for large PPI networks rather than preliminary networks

33

"Most good programmers do programming not because they expect to get paid or get adulation by the public, but because it is fun to program." - Linus Torvalds

Documents

PPI network construction and false positive detection Jin Chen CSE891-002 2012 Fall 1