Prediction of protein networks through data integration

Prediction of protein networks through data integration

Lars Juhl Jensen

EMBL Heidelberg

prediction of interactions

STRING

functional interactions

373 genomes

model organism databases

Ensembl

Genome Reviews

RefSeq

genomic context methods

gene neighborhood

gene fusion

phylogenetic profiles

Cell

Cellulosomes

Cellulose

correct interactions

wrong associations

phylogenetic profiles

SVDSingular Value Decomposition

Euclidian distance

gene neighborhood

sum of intergenic distances

raw quality scores

rank by reliability

not comparable

Euclidian distance

sum of intergenic distances

benchmarking

calibrate vs. gold standard

raw quality scores

probabilistic scores

curated knowledge

many sources

KEGGKyoto Encyclopedia of Genes and Genomes

Reactome

PIDNCI-Nature Pathway Interaction Database

STKESignal Transduction Knowledge Environment

MIPSMunich Information center

for Protein Sequences

Gene Ontology

different gene identifiers

synonyms list

literature mining

MEDLINE

SGDSaccharomyces Genome Database

The Interactive Fly

OMIMOnline Mendelian Inheritance in Man

co-mentioning

NLPNatural Language Processing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]


primary experimental data

gene expression

GEOGene Expression Omnibus

expression compendia

protein interactions

BINDBiomolecular Interaction Network Database

BioGRIDGeneral Repository for Interaction Datasets

DIPDatabase of Interacting Proteins

IntAct

MINTMolecular Interactions Database

HPRDHuman Protein Reference Database

many sources

different gene identifiers

redundancy

not comparable

merge data by publication

raw quality scores


combine all evidence

spread over many species

transfer by orthology

naïve Bayesian scoring

prediction of interactions

NetworKIN

the idea

phosphoproteomics

mass spectrometry

phosphorylation sites

Phospho.ELM

in vivo

kinases are unknown

computational methods

NetPhosK

Scansite

sequence motifs

kinase families

overprediction

no context

what a kinase could do

not what it actually does

context

co-activators

scaffolders

protein networks

the algorithm

NetworKIN

benchmarking

Phospho.ELM

2.5-fold better accuracy

context is crucial

global statistics

visualization

ATM signaling

experimental validation

summary

reanalysis

benchmarking

integration

complementary data types

computational methods

reproduce what is know

biological discoveries

testable hypotheses

Acknowledgments

The STRING database– Christian von Mering

– Michael Kuhn

– Berend Snel

– Martijn Huynen

– Sean Hooper

– Samuel Chaffron

– Julien Lagarde

– Mathilde Foglierini

– Peer Bork

Literature mining– Jasmin Saric

– Rossitza Ouzounova

– Isabel Rojas

The NetworKIN method– Rune Linding

– Gerard Ostheimer

– Francesca Diella

– Karen Colwill

– Jing Jin

– Pavel Metalnikov

– Vivian Nguyen

– Adrian Pasculescu

– Jin Gyoon Park

– Leona D. Samson

– Rob Russell

– Peer Bork

– Michael Yaffe

– Tony Pawson

Technology

Prediction of protein networks through data integration