Cytoscape An open-source software platform for the...

Preview:

Citation preview

Benno Schwikowski

CytoscapeAn open-source software platform

for the explorationof molecular interaction networks

Benno Schwikowski

Systems Biology Group – UP Biologie SystémiqueInstitut Pasteur, Paris

Benno Schwikowski

Overview

1. Molecular interaction networks 2. The Cytoscape platform

3. Active modules

Benno Schwikowski

Biology poses difficulties for approaches from physics and engineering

Courtesy L. Hood

• What are the effects of mutations on the system?• How do multiple mutations modify interactions with the environment?• How are molecular processes controlled?

Benno Schwikowski

Biological and engineering models

Lazebnik, Cancer Cell, Sep. 2002

Benno Schwikowski

Molecular interaction networks

Benno Schwikowski

Cartoons may not be enough

… and drawn cartoons

“Taken togethertogether, these data suggestsuggest that EBF & E2A maymay be less importantless important than

Pax-5 for regulating lowlow--levellevel mb-1 transcription at laterlater stages of development.”

Mol. Cell Biol. Dec’02, p.8850

These stories might be

• too qualitative• too hard to integrate

Benno Schwikowski

Cartoons are hard to combine

• The stories are designed around specific questions

• They are not written against a single conceptual scaffold

• We may not be able to integrate cartoons into coherent models

Courtesy H. Bolouri

Benno Schwikowski

Uses for the study of biological networks and systems

• Better characterize the function of single genes• Help structure, represent and interpret experimental data on

interactions and states– Integrate different types of experimental data– Relate mechanisms to states

• Build a detailed understanding of cellular processes – Allow prediction of cellular state observables at different levels of

detail– Allow intervention with predictable & measurable outcomes

• Guide experiments by providing testable hypotheses • Compare processes within and across organisms

Benno Schwikowski

The source of the network parts list

DNAmRNA

ProteinsPathways

NetworksCells

TissuesOrgans

IndividualsPopulationsEcosystems

DNA sequences

DNA sequencer

Benno Schwikowski

Sources of network state information

DNAmRNA

ProteinsPathways

NetworksCells

TissuesOrgans

IndividualsPopulationsEcosystems

DNA microarray

Benno Schwikowski

Sources of network state information

DNAmRNA

ProteinsPathways

NetworksCells

TissuesOrgans

IndividualsPopulationsEcosystems

Mass spectrum

Mass spectrometer

Benno Schwikowski

Reporter Gene

BaitProtein

BindingDomain

Prey Protein

ActivationDomain

• Two hybrid proteins are generated with transcription factor domains• Both fusions are expressed in a yeast cell that carries a reporter gene

whose expression is under the control of binding sites for the DNA-binding domain

Sources of network interaction information:The Two-Hybrid System

Benno Schwikowski

Reporter Gene

BaitProtein

BindingDomain

Prey Protein

ActivationDomain

• Interaction of bait and prey proteins localizes the activation domain to the reporter gene, thus activating transcription.

• Since the reporter gene typically codes for a survival factor, yeast colonies will grow only when an interaction occurs.

Sources of network interaction information:The Two-Hybrid System

Benno Schwikowski

Sources of network information:ChIP-chip

From Richard Young’s Websitehttp://web.wi.mit.edu/young/location/

CHromatinImmunoPreciptation-Chip(ChIP-Chip) Analysis(Ren et al., Science, 2000)

Metabolic networks are fairly detailed

• Stoichiometric matrix – network topology with stoichiometry of biochemical reactions

Mass balanceS·v = 0Subspace of R

Thermodynamicvi > 0Convex cone

Capacityvi < vmaxBounded convex cone

Glucose + ATP

Glucokinase

Glucose-6-Phosphate + ADP

Glucose -1ATP -1

G-6-P +1ADP +1

Glucokinase

n

“Comparative assessment of large-scale data sets of protein-protein interactions.” Von Mering, C. Nature 2002

“Among the [protein-protein] interactions proposed by high-throughput methods will be many false positives. In fact, we estimate that more than half of all current high-throughput data are spurious.”

Benno Schwikowski

Gene regulation can be complex

Yuh, Bolouri, Davidson, Science, 1998

Benno Schwikowski

High- and low-level modelingmay be combined

Ideker and Lauffenburger, Trends in Biotechnology 2003

Benno Schwikowski

High- and low-level modeling

Ideker and Lauffenburger, Trends in Biotechnology 2003

CytoscapeNetwork Visualization and Analysis

Courtesy M. Smoot

http://cytoscape.org 22

Cytoscape OverviewRich network visualizations

Powerful data mapping

Handles large networks

Supports many standards

Large community

Free (open-source)!

http://cytoscape.org 23

Network Data Import

SIF (Simple Interaction Format)

GML (Graph Markup Language)

XGMML (eXtensible Graph Markup and Modeling Language)

BioPax (Biological Pathway Data)

PSI-MI 1 & 2.5 (Protein Standards Initiative)

SBML Level 2 (Systems Biology Markup Language)

http://cytoscape.org 24

Formatted Text and Excel Files

http://cytoscape.org 25

Network Attribute Management

http://cytoscape.org 26

Data Integration1. Network Data

2. Attribute Data

YDR382W pp YDL130WYDR382W pp YFL039CYFL039C pp YCL040WYFL039C pp YHR179W

ExpressionValueYCL040W = 0.542YDL130W = -0.123YDR382W = -0.058YFL039C = 0.192YHR179W = 0.078

VizMapper

http://cytoscape.org 27

VizMapper

Map network state data onto visual attributes.

Attributes for nodes and edges.

Very Flexible.

http://cytoscape.org 28

Expression Data Node Color

http://cytoscape.org 29

Layout Algorithms

http://cytoscape.org 30

Network Editor

http://cytoscape.org 31

Filters

http://cytoscape.org 32

Linkout

Nodes and Edges act as hyperlinks to external databases.

User configurable URLs.

http://cytoscape.org 33

Large Networks

19,462 Nodes

31,130 Edges

Only half of what's possible!

http://cytoscape.org 34

Other Features

Manual Layout manipulation tools− align, scale, rotate

Manually override visual stylesUndo− Can undo most modifications to graphs

Publication Quality Graphics− Export PDF, SVG, PS

http://cytoscape.org 35

Cytoscape is ExtensibleCytoscape is open-source.

We provide a plug-in interface that allows anyone to write and distribute their own extensions to Cytoscape.

Plug-ins represent the primary analysis mechanism in Cytoscape.

Plug-ins are distributed from a central database and can installed while running.

http://cytoscape.org 36

Plugin ExamplesBiNGO (Analysis of GO categories found in network)

GenePro (Protein-Protein interaction cluster visualization)

jActiveModules (Search for significant networks)

NetworkAnalyzer (Statistical analysis of networks)

Agilent Literature Search (Network creation)

CyGoose (Gaggle communication)

See http://cytoscape.org for many more

http://cytoscape.org 37

Running Cytoscape

Cytoscape is licensed under the LGPL and is therefore freely available to everyone.

Cytoscape is written in Java and therefore runs on Windows, Mac, and Linux.

Cytoscape can be run locally or using Webstart.

http://cytoscape.org 38

Cytoscape applications

Cytoscape facilitates:− Network Visualization

− Network Analysis

− Data Integration

− A framework for new types of analysis

http://cytoscape.org 39

Cytoscape Consortium UC San Diego (Trey Ideker)

Institute for Systems Biology (Leroy Hood/Ilya Shmulevich)

Memorial Sloan-Kettering Cancer Center (Chris Sander)

University of Toronto (Gary Bader)

Agilent Technologies (Annette Adler)

Unilever (Guy Warner)

UC San Francisco (Bruce Conklin)

Institut Pasteur (Benno Schwikowski)

NIGMS/NIH GM070743-01

Getting started with Cytoscape

Tutorials on Cytoscape.org

Nature Protocols paper

Systemsbiology.fr

QuickTime™ and a decompressor

are needed to see this picture. QuickTime™ and a decompressor

are needed to see this picture.

Benno Schwikowski

Active Modules

Benno Schwikowski

Protein interaction networks

Benno Schwikowski

Protein-protein interactionsin yeast

Questions

• Is there any correlation between protein interactions and other attributes of proteins?• Is that correlation significant, i.e., would it not easily occur in random data?

Benno Schwikowski

Functionally related proteinsoccur as clusters of interacting proteins

Benno Schwikowski

Protein interactions contain informationabout cellular roles

Simple prediction algorithm for the cellular role of a protein

1) Rank known cellular roles among the interactorsfrom most frequent to least frequent.

2) Take the first three (or less) roles as predictions.

Accuracy on 1,393 out of 2,039 proteins: 72% (6 out of 8)…on 100 scrambled networks: 12% (1 out of 8).

Benno Schwikowski

Protein interactions providecontext information

RNA splicing

Mayer & Hieter, Nature Biotechnology 2000

Benno Schwikowski

Modular structure of cellular networks

Hartwell et al., Nature 1999

Benno Schwikowski

The cell as an information processor

Hartwell et al., Nature 1999

Benno Schwikowski

Advantage of modules

• TheoreticalThere are 2^n different boolean functions on n variables

• Practical implicationThere are fewer components and fewer experiments to perform

00001111

00110011

01010101

00010101

x1 x2 x3 f

Benno Schwikowski

Molecular Interaction Network

Benno Schwikowski

The “system” notion

Benno Schwikowski

Approach

1. Use interaction data:The system components have to interact with each other

2. Use state data:System components have to change synchronously

Benno Schwikowski

Conditions -> gal1D gal2D gal3D gal4D gal5D gal6D gal7D ga

COX6 0.034 0.052 0.152 0.111 0.198 0.097 0.171NDT80 0.09 0 0.041 0.007 0.157 0.035 0.037PRS1 0.167 0.063 0.23 0.233 0.003 0.234 0.25UPF3 0.245 0.415 0.253 0.471 0.115 0.111 0.061OPI1 0.174 0.045 0.046 0.015 0.098 0.001 0.029YGR145W 0.387 0 0.036 0.577 0.151 0.255 0.101YGL041C 0.285 0.232 0.126 0.086 0.096 0.002 0.21CRM1 0.018 0.009 0.07 0.001 0.052 0.028 0.017HIS3 0.432 0.568 0.339 0.71 0.188 0.07 0.619CIT2 0.085 0.272 0.038 0.392 0.168 0.077 0.416KHS1 0.159 0.168 0.149 0.139 0.293 0.023 0.043YBR026C 0.276 0.072 0.324 0.189 0.014 0.142 0.243YMR244W 0.078 0 0.077 0.239 0.077 0.254 0.126YMR317W 0.181 0.324 0.065 0.086 0.288 0.122 0.233YAR047C 0.234 0.121 0.019 0.109 0.107 0.05 0.156DAL7 0.289 0.168 0.09 0.161 0.017 0.041 0.091YDL177C 0.002 0.295 0.041 0.367 0.183 0.205 0.085YLR338W 0.216 0.091 0.051 0.096 0.07 0.044 0.082YGR073C 0.125 0.394 0.056 0.126 0.218 0.088 0.122YGR146C 0.189 0.308 0.345 0.067 0.432 0.014 0.116

Approach – Summary

Experiments

Gen

es

2. Differential Gene/ProteinAbundances/Activities

1. Interaction networkbetween

genes/proteinsConditions -> gal1D gal2D gal3D gal4D gal5D gal6D gal7D ga

COX6 0.034 0.052 0.152 0.111 0.198 0.097 0.171NDT80 0.09 0 0.041 0.007 0.157 0.035 0.037PRS1 0.167 0.063 0.23 0.233 0.003 0.234 0.25UPF3 0.245 0.415 0.253 0.471 0.115 0.111 0.061OPI1 0.174 0.045 0.046 0.015 0.098 0.001 0.029YGR145W 0.387 0 0.036 0.577 0.151 0.255 0.101YGL041C 0.285 0.232 0.126 0.086 0.096 0.002 0.21CRM1 0.018 0.009 0.07 0.001 0.052 0.028 0.017HIS3 0.432 0.568 0.339 0.71 0.188 0.07 0.619CIT2 0.085 0.272 0.038 0.392 0.168 0.077 0.416KHS1 0.159 0.168 0.149 0.139 0.293 0.023 0.043YBR026C 0.276 0.072 0.324 0.189 0.014 0.142 0.243YMR244W 0.078 0 0.077 0.239 0.077 0.254 0.126YMR317W 0.181 0.324 0.065 0.086 0.288 0.122 0.233YAR047C 0.234 0.121 0.019 0.109 0.107 0.05 0.156DAL7 0.289 0.168 0.09 0.161 0.017 0.041 0.091YDL177C 0.002 0.295 0.041 0.367 0.183 0.205 0.085YLR338W 0.216 0.091 0.051 0.096 0.07 0.044 0.082YGR073C 0.125 0.394 0.056 0.126 0.218 0.088 0.122YGR146C 0.189 0.308 0.345 0.067 0.432 0.014 0.116

Benno Schwikowski

Comparison to clustering

1. Connectivity by scaffold of protein interactions

Direct causal explanations and testable hypotheses

2. Significant change observed under certainexperimental conditions

Module need not be active under allexperimental conditions

Benno Schwikowski

Galactose induction pathway

Ideker et al. Science 292: 929 (2001)

Benno Schwikowski

What are the underlying regulatory interactions responsible for the observed changes

in gene expression?

Prot.–prot. interactions

BIND~ 6300 proteins, 55785 interactions in yeast

RNA-expression data

• 20 perturbations of thegalactose utilizationpathway

Prot.→DNA interactions

Transfac/ChIP data~10,000 interactionsfor yeast

Protein expression data

abundances, modifications, translation states

Small mol. interactions

Metabolites, drugs, andhormones: KEGG,enzymes, etc.

Metabolic profiles

Abundances may soon be avail. on a global scale

INTEGRATEDMOLECULAR

INTERACTIONNETWORK

This technique is extensible to a variety of data types.

Benno Schwikowski

protein→DNA

0

+3

Expression change(log10)

protein–protein

-3

Ideker et al. Science 292: 929 (2001)

The galactose pathwayin our network representation

We consider only the significance of change, not its direction.

Benno Schwikowski

Module – A mathematical definition

A scoring system for regulatory “activity”

• Assign significance to each gene expression change† and express as a z-score• The z-score of an entire subnetwork is the normalized sum of scores of its

nodes† Ideker, Thorsson, Siegel, and Hood, J. Comp. Bio. 7: 805 (2000)

A B C D

⎥⎥⎥⎥

⎢⎢⎢⎢

⎡−−−

0312230320111221

4321

Pert

urba

tions

/c

ondi

tions

⎥⎥⎥⎥

⎢⎢⎢⎢

⎡−−−

0312230320111221

4321

The

p(1.0)=0.159

z(0.159)=1.0

z-score

p-value

Combining z-scores under one condition

A B C D

⎥⎤

⎢⎡ − 1221

14

1221=

+−+

Scoring over multiple perturbations/conditions

Pertu

rbat

ions

/c

ondi

tions

A(1)

A(2)

A(3)

A(4)

Scoring over multiple conditionsRank adjustment

• What is the probability that, out of m z-scores, the first j ones are larger than A(j)?

• Idea: Compute the probability that j or morez-scores are larger than A(j):

where

Scoring over multiple perturbations/conditions

Pertu

rbat

ions

/c

ondi

tions

FinalScore

Benno Schwikowski

Different overlapping condition sets

Each subnetwork is active for a subset of conditions

Running the algorithm again on the high-scoring, 340-gene subnetwork reveals further structure

Each condition may appear several times, or not at all, depending on how well it is (a) significant and (b) explained by the interaction network.

Benno Schwikowski

Pathways in Rosetta’s compendium(300 conditions)

Getting started with Cytoscape

Tutorials on Cytoscape.org

Nature Protocols paper

Systemsbiology.fr

QuickTime™ and a decompressor

are needed to see this picture. QuickTime™ and a decompressor

are needed to see this picture.

Benno Schwikowski

THANK YOU FOR YOUR ATTENTION

Benno Schwikowski

Finding good modules

Benno Schwikowski

Finding good modules in a large network is hard

• Once specified, we can easily score a particular pathway.• But how to identify the highest-scoring pathways in a full molecular

interaction network of thousands of nodes and interactions?• This problem is NP-complete, • We use a customized version of a general-purpose algorithm to

detect high-scoring pathways from the data.

Use a method based on simulated annealing.

Benno Schwikowski

Computational complexity

• 6,000 genes form up to 26,000 possible gene sets• 300 conditions have 2300 subsets ⇒ 2180,000 > 1050,000 combinations to search

• Finding the highest-scoring gene set is NP-hard, even for a single condition

Benno Schwikowski

NP-hardness

• NP-hardness is a property of computational problems• It implies any algorithm that solves the problem runs at

least as long as thousands of other well-known problems

• Efficient algorithms for NP-hard problems are unknown (and probably don’t exist)

• Thus, need to look for approximation or heuristicalgorithms

Benno Schwikowski

20 GAL conditions vs.the entire interaction network

Benno Schwikowski

Several subnetworks emerge

Benno Schwikowski

Detail of subnetwork 1bGalactose metabolism

Our method is only concerned with the significance of change, not its direction.

Gal4 doesn’t show dramatic expression change, but it is included because it connects and explains the other genes’differential expression.

Benno Schwikowski

Galactose induction pathway

Ideker et al. Science 292: 929 (2001)

Benno Schwikowski

SUMMARY

• Method for explaining gene expression profiles with molecular interactions found in the public databases.

•Results in testable hypotheses for the signaling and regulatory pathways behind observed gene expression changes.

Benno Schwikowski

Features of this approach

• Tries to define clusters of genes that show similar concerted reactions to perturbations

• Incorporates many data types• Robust against noise, false positive interactions• Many, and experiment-specific networks identified• Interpretive framework offers testable hypotheses

Benno Schwikowski

The “system” notion

Recommended