Graphs in R and Bioconductor

Graphs in R andBioconductor

Statistical Analysis of Microarray Expression Data with Rand Bioconductor

Copenhagen DK, November 2007

Denise Scholtens, Ph.D.

Assistant Professor, Department of Preventive MedicineNorthwestern University Medical School, Chicago IL USA

[email protected]

Graphs

Sets of nodes and edges Nodes: objects of interest Edges: relationships between them

A useful abstraction to talk aboutrelationships, interactions, etc.

Graphs Nodes

Node types Conditions on interactions between nodes

Edges Edge types Direction Weights

Graphs are very flexible, but can quickly getcomplicated!

Graphs Knowledge representation

Data structure, visualization Exploratory data analysis (EDA)

Graph traversal and analysis Inference

Adopting statistical paradigm for makingconclusions about data recorded in graphs

Graphs Inference

Ex. Testing association between two graphs Ex. Identifying local features of large global graphs Statistical approaches

FP – edges that were tested, were found, but are not there innature

FN – edges that were tested, were not found, but are there innature

Untested – edges that were never tested, may or may notexist in nature, but we don’t know

Example: undirected graph

Example: directed graph

Elementary computations on IMCA pathway

> library("graph")> data("integrinMediatedCellAdhesion")> class(IMCAGraph)> s = acc(IMCAGraph, "SOS")Ha-Ras Raf MEK 1 2 3 ERK MYLK MYO 4 5 6F-actin cell proliferation 7 5

Example: Directed AcyclicGraph (DAG) Gene Ontology (GO) A structed vocabulary to describe

molecular function of gene products,biological processes, and cellularcomponents.

A set of "is a", "is part of", and "has a"relationships between these terms

GO graphs

Example: Bipartite graph

Two distinct sets of nodes U and V Edges exist between elements of U and V Edges cannot connect nodes in U to other

nodes in U, and similarly for V E.g. literature co-citation graphs

Gene-Literature graphs

DKC1

An adjacency matrix AG (n x m) is often used torepresent a bipartite graph G with node sets U, V

One mode graphs

AU = AGt AG – co-citation of genes in literature

AV = AG AG

t – literature containing common genes

(Boolean algebra)

Bipartite graph transformation

Data structure

Flexible way to record and visualize data Undirected/directed edges Node colors Structures (DAG, bipartite graph) Multiple edges Weighted or labeled edges

Note: user has responsibility to recognize and makeuse of the graph structure!

Directed, undirected graphsAdjacent nodesAccessible nodesSelf-loopNode degreeWalk: alternating sequence of nodes and incident edgesClosed walkDistance between nodes, shortest walkTrail: walk with no repeated edgesPath: trail with no repeated nodes (except possibly first/last)Connected graphWeakly connected directed graphStrongly connected directed graph

Graphs: vocabulary

graph basic class definitions andfunctionality

Rgraphviz rendering functionalityDifferent layout algorithms.Node plotting, line type, color etc. can becontrolled by the user.

RBGL interface to graph algorithms

graph, Rgraphviz, RBGL

graph package Classes

graph a general class that all other classes should extend

graphNEL node/edge-list representation Can specify direction of edges, edge weights, etc

distGraph graph based on distances between nodes

clusterGraph series of completely connected subgraphs (cliques) with

no edges between them

> library("graph"); library(Rgraphviz)

> myNodes = c("s", "p", "q", "r")

> myEdges = list(s = list(edges = c("p", "q")),p = list(edges = c("p", "q")),q = list(edges = c("p", "r")),r = list(edges = c("s")))

> g = new("graphNEL", nodes = myNodes,edgeL = myEdges, edgemode ="directed")

> plot(g)

Creating a graph

> nodes(g)[1] "s" "p" "q" "r"

> edges(g)$s[1] "p" "q"$p[1] "p" "q"$q[1] "p" "r"$r[1] "s"

> degree(g)$inDegrees p q r1 3 2 1$outDegrees p q r2 2 2 1

Querying nodes, edges, degree

> g1 <- addNode("e", g)

> g2 <- removeNode("d", g)

> ## addEdge(from, to, graph, weights)

> g3 <- addEdge("e", "a", g1, pi/2)

> ## removeEdge(from, to, graph)

> g4 <- removeEdge("e", "a", g3)

> identical(g4, g1)

[1] TRUE

Graph manipulation

> adj(g, c("b", "c"))$b[1] "b" "c"$c[1] "b" "d"

> acc(g, c("b", "c"))$ba c d3 1 2

$ca b d2 1 1

Adjacent and accessible nodes

Node-edge lists Adjacency matrix (straightforward) Adjacency matrix (sparse) From-To matrix

They are equivalent, but may be hugely differentin performance and convenience for differentapplications.

Can coerce between the representations

Graph representation

> ft [,1] [,2][1,] 1 2[2,] 2 3[3,] 3 1[4,] 4 4

> ftM2adjM(ft) 1 2 3 41 0 1 0 02 0 0 1 03 1 0 0 04 0 0 0 1

> ftM2graphNEL(ft)A graphNEL graph with directed edgesNumber of Nodes = 4Number of Edges = 4

Graph representations:from-to matrix

Connected componentscc = connComp(rg)table(listLen(cc)) 1 2 3 4 15 1836 7 3 2 1 1

Choose the largest componentwh = which.max(listLen(cc))sg = subGraph(cc[[wh]], rg)

Depth first searchdfsres = dfs(sg, node = "N14")nodes(sg)[dfsres$discovered][1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53"[9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35"[17] "N48" "N09"

rg

RBGL: interface to BoostGraph Library

dfs(sg, "N14")bfs(sg, "N14")

depth/breadth first search

connected componentssc = strongComp(g2)

nattrs = makeNodeAttrs(g2,fillcolor="")

for(i in 1:length(sc)) nattrs$fillcolor[sc[[i]]] =

myColors[i]

plot(g2, "dot", nodeAttrs=nattrs)

wc = connComp(g2)

Different algorithms for different types of graphso all edge weights the sameo positive edge weightso real numbers

…and different settings of the problemo single pairo single sourceo single destinationo all pairs

Functionsbfsdijkstra.spsp.betweenjohnson.all.pairs.sp

Shortest path algorithms

1

set.seed(123)rg2 = randomEGraph(nodeNames, edges = 100)fromNode = "N43"toNode = "N81"sp = sp.between(rg2,

fromNode, toNode)

sp[[1]]$path [1] "N43" "N08" "N88" [4] "N73" "N50" "N89" [7] "N64" "N93" "N32" [10] "N12" "N81"

sp[[1]]$length [1] 10

Shortest path

ap = johnson.all.pairs.sp(rg2)hist(ap)

Shortest path

mst = mstree.kruskal(gr)gr

Minimal spanning tree

Consider graph g with single connectedcomponent.Edge connectivity of g: minimumnumber of edges in g that can be cut toproduce a graph with two components.Minimum disconnecting set: the set ofedges in this cut.

> edgeConnectivity(g)$connectivity[1] 2

$minDisconSet$minDisconSet[[1]][1] "D" "E"

$minDisconSet[[2]][1] "D" "H"

Connectivity

dot: directed graphs. Works best on DAGsand other graphs that can be drawn ashierarchies.

neato: undirected graphs using ’spring’ models

twopi: radial layout. One node (‘root’) chosen asthe center. Remaining nodes on a sequence ofconcentric circles about the origin, with radialdistance proportional to graph distance. Rootcan be specified or chosen heuristically.

Rgraphviz: layout engines



Combining R graphics and Rgraphviz: custom nodedrawing functions

Inference questions

Compare two graphs GraphAT

Identify local features in global graphs apComplex

Large scale topological features of graphs RBGL, measurement error effects

Estimating error probabilities ppiStats

Compare two graphsGraphAT

Compare two graphsObserved graph of the literature protein-protein interactions

used in Ge et al. (315 edges, 298 nodes)

Cluster graph for the 30 Clusters reported in Ge et al. (156,205edges, 2885 nodes)

(All genes that are not in the list of literature-reported PPIs

have been removed from this graph for visualization purposes.)

Comparing two graphs Nodes – yeast genes Graph 1 – literature reported protein-protein

interactions Graph 2 – cell cycle gene expression cluster

membership Do the graphs overlap more than random? Is there anything special about the overlapping

edges?

Graph of the reported intracluster edges in Ge et al.

(42 edges,65 nodes)This graph was derived by intersecting the observed literature graph with the cluster graph.

Graph resulting from random reassignment

of 315 edges among 2885 nodesNote that the structure of this graph is quite different from the observed literature graph.

Intersection of random edge graph with cluster graph

Intersecting Edges

Random Edge algorithm (RE)

Permuting Node Labels algorithm (PN)

Graph with the same test statistic as the observed graph ofintracluster edges reported in Ge et al – 42 intracluster edges

Other Test Statistics

Questions of Interest Is it reasonable to condition on the structure of the

observed graphs – something like anancillary/sufficient statistic?

Why is the number of intersecting edges invariant tothe node label permutation and random edgereassignment algorithms?

What are the most informative test statistics?

EDA Questions of Interest Which expression clusters have intersections with which of the

literature clusters? Are known cell-cycle regulated protein complexes indeed

clustered together in both graphs? Are there expression clusters that have a number of literature

cluster edges going between them, suggesting that expressionclustering was too fine, or that literature clusters are not cell-cycle regulated.

Is the expression behavior of genes that are involved in multipleprotein complexes different from that of genes that are involvedin only one complex?

Identify local featuresin global graphs

apComplex

Local Modeling of Global Interactome Data

AP-MS (Affinity Purification - Mass Spectrometry)

Measures Complex Comembership

Gavin, et al. (Nature, 2002 and Nature, 2006) Ho, et al. (Nature, 2002) Krogan, et al. (Mol Cell 2004, Nature 2006)

Y2H (Yeast Two Hybrid)

Measures Physical Interactions

Ito, et al. (PNAS, 1998) Uetz, et al. (Nature, 2000)

AP-MS data:

Using a bait protein, AP-MS technology finds prey proteins that arecomembers of at least one complex with the bait.

Y2H data:

Y2H technology finds pairs of physically interacting proteins.

(one purification)

bait

prey

AP-MS data: Y2H data:

We want to estimate thebipartite protein complexmembership graph, A:

*Estimation of A requiresestimation of K, the numberof complexes.

1. Some proteins participate in more than one complex

2. In an AP-MS experiment, some proteins are used as baits andsome proteins are only ever found as prey

3. Graph theoretic paradigm to allow for succinct formulation• Bipartite graph for complex membership (A)• Relationship of complex membership (A) to complex comembership

(Y) assayed in an AP-MS experiment (Z)• AP-MS and Y2H are different technologies that measure different

relationships between proteins

4. Statistical paradigm to allow for false positive and false negativeobservations

Four unique aspects to thealgorithm

PP2A

Heterotrimericcomplex consisting of:

Tpd3- regulatory A subunit

Rts1 or Cdc55- regulatory B subunits

Pph21 or Pph22- catalytic subunits

Jiang and Broach (1999). EMBO.

1. Some proteins participate inmore than one complex

Gavin, et al. (2002)Rgraphviz plot ofyTAP C151

Bader & Hogue (2002)Portion of Figure 2: Overlap of the spoke models of TAP and HMS-PCI.

Jansen, et al. (2003)PIT Bayesian Network, LR>600

http://genecensus.org/intint

Tpd3

Pph21

Myo5

Cdc55

Cdc11

Pph22

Cdc10

1. Some proteins participate in more than one complex

PP2A

Heterotrimericcomplex consisting of:

Tpd3- regulatory A subunit

Rts1 or Cdc55- regulatory B subunits

Pph21 or Pph22- catalytic subunits

Jiang and Broach (1999). EMBO.

The apComplex algorithm detects:

Zds1 and Zds2 (known cell-cycle regulators)only exist in complexes with the Cdc55-Pph22 trimer!

2. Graph theoretic paradigm toallow for succinct expressionof constructs involved

•Bipartite graph forcomplex membership•Relationship of complexmembership (A) tocomplex comembership(Y) assayed in an AP-MSexperiment (Z)•AP-MS and Y2H aredifferent technologies thatmeasure differentrelationships betweenproteins

2. Graph theoreticparadigm to allow forsuccinct expression ofconstructs involved

•Relationship ofcomplex membership(A) to complexcomembership (Y)assayed in an AP-MSexperiment (Z)

)ˆ,ˆ(algorithmestimation

proteinsonly -hit and proteins bait for dataMSAP'

AYZ

ZYA

M

NAAY

!!!!!! "!

!!!!!!!!! "!!!! "!#$=

In summary…

We start with an initial estimate for A, and then refine thatestimate according to a two component probability measure:

P(Z|A,µ,α)=L(Z|Y=A⊗A',µ,α)C (Z|A,µ,α)usual likelihood regularization/penalty term

(no. of complexes)

Large scale topological featuresof graphs

RBGLmeasurement error

Apl6

Apm3 Apl5

untested: ?

tested:absent

Apl6

Apm3 Apl5

Eno2

Aps3

Ckb1

6 AP-MS observationsGavin et al. (2002)

Apl5: Apl5, Apl6, Apm3, Aps3, Ckb1Apl6: Apl5, Apl6, Apm3, Eno2Apm3: Apl6, Apm3

Measurement ErrorFPsFNsStochasticSystematic

Missing DataUntested Edges

Suitable InferenceWhat can weconclude?

Statistical Modeling Network data are experimentally obtained

and hence should be subjected to sametypes of data analysis as other data

Given some model, what can we conclude?

Likelihood methods Global and local feature estimation

Measurement Error

Stochastic FPs and FNs made ‘randomly’

Systematic FPs and FNs made in some predictable way ‘Sticky’ proteins may cause FPs Conformationally deformed proteins may cause FNs Errant treatment of untested edges as absent will

cause systematic FNs

Random graphs vs regularnetworks and what’s in between?

Small-world connectivity Watts & Strogatz (1998) 6-degrees-of separation Lsmall-world ≈ Lrandom

L=average path length Csmall-world >> Crandom

C=average clustering coefficient of nodes If node n has kn neighbors, then

21

#

)/-(kk

kC

nn

nn

neighbors between edges observed =

Small-world

Watts & Strogatz (Nature 1998)

Scale-free

Class of small world networks Node degree distribution follows a power

law In scale-free graphs there are a few highly

connected “hubs” Biology: Relative robustness of network to

random perturbation, but huge breakdownto targeted disruption

Simulation Study Random, Scale-Free, Overlapping Cluster

Graphs 50, 500, and 1000 nodes Approx 5 edges/node Stochastic FNs: 0.05, 0.15, 0.25 Stochastic FPs: PPV=0.50 Systematic FPs: ‘sticky’ baits detect

neighbors of neighbors with p=0.50

L, stochastic FNs

L, stochastic FNs

In general, FNs increase L In the overlapping cluster graph, FNs splinter

graph into unconnected components How to treat L for an unconnected graph? Misleading results if treated naïvely

C, stochastic FPs(probability an observed edge is true =0.5)

C, stochastic FPs

If a node has k neighbors, C is the fraction ofthose neighbors that exist

Both numerator and denominator areaffected

For cluster graphs, nodes have moreneighbors due to FPs, but the number ofedges between the neighbors does notincrease proportionately

Node degree distribution,systematic FPs

Looked at log-log plot of complementarycumulative distribution function (notfrequency distribution) Based on theoretical work by Li, et al (2006)

Towards a theory of scale-free graphs: Definition,properties, and implications. InternetMathematics.

Assess fit of straight line using R2

Node degree distribution,systematic FPs

Take Home Messagefrom Simulation Study

FPs and FNs do affect statistics on graphs Even with small amounts of measurement

error, the effects can lead to biologicalmisinterpretations Specifically, scale-free

Estimating error probabilities

ppiData

Systematic Error:Bias in Bait-Prey Systems

Apl6

Apm3 Apl5

Apl6

Apm3 Apl5

Doubly tested bait-bait edges may be• reciprocated

•tested twice, observed twice• unreciprocated

•tested twice, observed once

For a bait subject only to stochasticerror, we expect the set ofunreciprocated edges to consist ofan approximately equal number of in-and out-edges.

If this is not the case, the bait issubject to systematic bias.

In-degree vs. Out-degreeunreciprocated edges in bait-induced subgraphs (square root scale)

Gavin, 2006 (AP-MS) Krogan, 2006 (AP-MS)

Per protein ‘coin-tossing’model

Quantify departure from symmetry using abinomial distribution with probabilityparameter p=0.50.

Previous pictures show nodes with p-value<0.01 in dark blue.

Removing these nodes has implications foroverall estimates of stochastic FP and FNprobabilities.

Global estimation of pTP andpFP

n.convolutio the withdeal weSo

. and observe We

.)1())1(2()(

),|,Pr(

, graph the in edges potential of number total the for Then,

ns.observatio positive false the for variables random be and let Similarly,

.)1())1(2(),|,Pr(

, graph the in edges of number true the given Then,

ns.observatio positive true the for variables random be and Let

))((22

)(22

FTFT

urN

FP

u

FPFP

r

FP

FFTFF

F

FPTFFFF

FF

ur

TP

u

TPTP

r

TP

TTTTT

T

TPTTTTT

T

TT

UUURRR

ppppurNur

puUrR

N

UR

ppppurur

puUrR

UR

FFTFF

TTTTT

+=+=

!!""#

$%%&

'

!!(!

(=(==

!!""#

$%%&

'

!!(

(=(==

(

!!(!

!!(

Estimating pTP and pFP Using the expectation of R and U, we can derive two

independent equations with three unknown parameters (pTP, pFP,and ΔT).

For resultant the method of moments estimators, any one of theparameters defines the other two, so we can easily derive a one-dimensional solution manifolds (with variance bounds).

In practice for AP-MS data, we choose to estimate pTP using agold standard set of complex co-membership relationships thatexist under similar experimental conditions to those used for AP-MS studies.

Including systematically biased baits Without systematically biased baits

Apl6

Apm3 Apl5

untested: ?

tested:absent

Apl6

Apm3 Apl5

Eno2

Aps3

Ckb1

What is an example and do statistics really make a difference?

Bait-induced subgraph:

Degree = number of edgesincident on a node

For node n we observe:

IEn = set of in-edgesOEn = set of out-edges

IDn = |IEn| = in-degreeODn = |OEn| = out-degree

Rn = |IEn ∩ OEn| = ‘reciprocated’ degree

Un = |{IEn U OEn}\{IEn ∩ OEn}|= ‘unreciprocated’ degree

IEApm3 = {Apl5, Apl6}OEApm3 = {Apl6}

IDApm3 = 2ODApm3 = 1

RApm3 = 1UApm3 = 1

Estimating degreeunder stochastic error only

Current practice : Dn=Rn+UnDApm3=2

Apl6

Apm3 Apl5

Apl6

Apm3 Apl5

Eno2

Aps3

Ckb1

What is an example and do statistics really make a difference?

Likelihood ApproachpTP=true positive probabilitypFP=false positive probability

If pFP=0, then for ‘true’ edgespT2=p(observing reciprocated edge) = pTP

2

pT1=p(observing unreciprocated edge)=2pTP(1-pTP)pT0=p(observing no edge)=(1-pTP)2

ur

T

u

T

r

T

T

TP pppururp

puUrR!!"

" ##$

%&&'

(

!!"

"

!="==

="

012

01

1),|,Pr(

degree Given

Write a similar statement for FPs and then maximize the convolution to estimate degree.

Notes Independence assumption allows product of

multinomial probabilities Only models stochastic error Only one observation in the likelihood Truncated binomial for data not subject to

FPs was developed by Blumenthal andDahiya (JASA 1981) and Olkin et al. (JASA1981).

Summary

Three main applications(1) Knowledge representation(2) Exploratory data analysis(3) Inference

Bioconductor provides a rich set of tools for(1) and (2)…need more of (3)!

Acknowledgements

Robert Gentleman Wolfgang Huber (thanks for lots of slides) Vince Carey Jeff Gentry Elizabeth Whalen Seth Falcon

Documents

Graphs in R and Bioconductor