88
Graphs in R and Bioconductor Statistical Analysis of Microarray Expression Data with R and Bioconductor Copenhagen DK, November 2007 Denise Scholtens, Ph.D. Assistant Professor, Department of Preventive Medicine Northwestern University Medical School, Chicago IL USA [email protected]

Graphs in R and Bioconductor

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graphs in R and Bioconductor

Graphs in R andBioconductor

Statistical Analysis of Microarray Expression Data with Rand Bioconductor

Copenhagen DK, November 2007

Denise Scholtens, Ph.D.

Assistant Professor, Department of Preventive MedicineNorthwestern University Medical School, Chicago IL USA

[email protected]

Page 2: Graphs in R and Bioconductor

Graphs

Sets of nodes and edges Nodes: objects of interest Edges: relationships between them

A useful abstraction to talk aboutrelationships, interactions, etc.

Page 3: Graphs in R and Bioconductor

Graphs Nodes

Node types Conditions on interactions between nodes

Edges Edge types Direction Weights

Graphs are very flexible, but can quickly getcomplicated!

Page 4: Graphs in R and Bioconductor

Graphs Knowledge representation

Data structure, visualization Exploratory data analysis (EDA)

Graph traversal and analysis Inference

Adopting statistical paradigm for makingconclusions about data recorded in graphs

Page 5: Graphs in R and Bioconductor

Graphs Inference

Ex. Testing association between two graphs Ex. Identifying local features of large global graphs Statistical approaches

FP – edges that were tested, were found, but are not there innature

FN – edges that were tested, were not found, but are there innature

Untested – edges that were never tested, may or may notexist in nature, but we don’t know

Page 6: Graphs in R and Bioconductor

Example: undirected graph

Page 7: Graphs in R and Bioconductor

Example: directed graph

Page 8: Graphs in R and Bioconductor

Elementary computations on IMCA pathway

> library("graph")> data("integrinMediatedCellAdhesion")> class(IMCAGraph)> s = acc(IMCAGraph, "SOS")Ha-Ras Raf MEK 1 2 3 ERK MYLK MYO 4 5 6F-actin cell proliferation 7 5

Page 9: Graphs in R and Bioconductor

Example: Directed AcyclicGraph (DAG) Gene Ontology (GO) A structed vocabulary to describe

molecular function of gene products,biological processes, and cellularcomponents.

A set of "is a", "is part of", and "has a"relationships between these terms

Page 10: Graphs in R and Bioconductor

GO graphs

Page 11: Graphs in R and Bioconductor

Example: Bipartite graph

Two distinct sets of nodes U and V Edges exist between elements of U and V Edges cannot connect nodes in U to other

nodes in U, and similarly for V E.g. literature co-citation graphs

Page 12: Graphs in R and Bioconductor

Gene-Literature graphs

DKC1

Page 13: Graphs in R and Bioconductor

An adjacency matrix AG (n x m) is often used torepresent a bipartite graph G with node sets U, V

One mode graphs

AU = AGt AG – co-citation of genes in literature

AV = AG AG

t – literature containing common genes

(Boolean algebra)

Bipartite graph transformation

Page 14: Graphs in R and Bioconductor

Data structure

Flexible way to record and visualize data Undirected/directed edges Node colors Structures (DAG, bipartite graph) Multiple edges Weighted or labeled edges

Note: user has responsibility to recognize and makeuse of the graph structure!

Page 15: Graphs in R and Bioconductor

Directed, undirected graphsAdjacent nodesAccessible nodesSelf-loopNode degreeWalk: alternating sequence of nodes and incident edgesClosed walkDistance between nodes, shortest walkTrail: walk with no repeated edgesPath: trail with no repeated nodes (except possibly first/last)Connected graphWeakly connected directed graphStrongly connected directed graph

Graphs: vocabulary

Page 16: Graphs in R and Bioconductor

graph basic class definitions andfunctionality

Rgraphviz rendering functionalityDifferent layout algorithms.Node plotting, line type, color etc. can becontrolled by the user.

RBGL interface to graph algorithms

graph, Rgraphviz, RBGL

Page 17: Graphs in R and Bioconductor

graph package Classes

graph a general class that all other classes should extend

graphNEL node/edge-list representation Can specify direction of edges, edge weights, etc

distGraph graph based on distances between nodes

clusterGraph series of completely connected subgraphs (cliques) with

no edges between them

Page 18: Graphs in R and Bioconductor

> library("graph"); library(Rgraphviz)

> myNodes = c("s", "p", "q", "r")

> myEdges = list(s = list(edges = c("p", "q")),p = list(edges = c("p", "q")),q = list(edges = c("p", "r")),r = list(edges = c("s")))

> g = new("graphNEL", nodes = myNodes,edgeL = myEdges, edgemode ="directed")

> plot(g)

Creating a graph

Page 19: Graphs in R and Bioconductor

> nodes(g)[1] "s" "p" "q" "r"

> edges(g)$s[1] "p" "q"$p[1] "p" "q"$q[1] "p" "r"$r[1] "s"

> degree(g)$inDegrees p q r1 3 2 1$outDegrees p q r2 2 2 1

Querying nodes, edges, degree

Page 20: Graphs in R and Bioconductor

> g1 <- addNode("e", g)

> g2 <- removeNode("d", g)

> ## addEdge(from, to, graph, weights)

> g3 <- addEdge("e", "a", g1, pi/2)

> ## removeEdge(from, to, graph)

> g4 <- removeEdge("e", "a", g3)

> identical(g4, g1)

[1] TRUE

Graph manipulation

Page 21: Graphs in R and Bioconductor

> adj(g, c("b", "c"))$b[1] "b" "c"$c[1] "b" "d"

> acc(g, c("b", "c"))$ba c d3 1 2

$ca b d2 1 1

Adjacent and accessible nodes

Page 22: Graphs in R and Bioconductor
Page 23: Graphs in R and Bioconductor

Node-edge lists Adjacency matrix (straightforward) Adjacency matrix (sparse) From-To matrix

They are equivalent, but may be hugely differentin performance and convenience for differentapplications.

Can coerce between the representations

Graph representation

Page 24: Graphs in R and Bioconductor

> ft [,1] [,2][1,] 1 2[2,] 2 3[3,] 3 1[4,] 4 4

> ftM2adjM(ft) 1 2 3 41 0 1 0 02 0 0 1 03 1 0 0 04 0 0 0 1

> ftM2graphNEL(ft)A graphNEL graph with directed edgesNumber of Nodes = 4Number of Edges = 4

Graph representations:from-to matrix

Page 25: Graphs in R and Bioconductor

Connected componentscc = connComp(rg)table(listLen(cc)) 1 2 3 4 15 1836 7 3 2 1 1

Choose the largest componentwh = which.max(listLen(cc))sg = subGraph(cc[[wh]], rg)

Depth first searchdfsres = dfs(sg, node = "N14")nodes(sg)[dfsres$discovered][1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53"[9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35"[17] "N48" "N09"

rg

RBGL: interface to BoostGraph Library

Page 26: Graphs in R and Bioconductor

dfs(sg, "N14")bfs(sg, "N14")

depth/breadth first search

Page 27: Graphs in R and Bioconductor

connected componentssc = strongComp(g2)

nattrs = makeNodeAttrs(g2,fillcolor="")

for(i in 1:length(sc)) nattrs$fillcolor[sc[[i]]] =

myColors[i]

plot(g2, "dot", nodeAttrs=nattrs)

wc = connComp(g2)

Page 28: Graphs in R and Bioconductor

Different algorithms for different types of graphso all edge weights the sameo positive edge weightso real numbers

…and different settings of the problemo single pairo single sourceo single destinationo all pairs

Functionsbfsdijkstra.spsp.betweenjohnson.all.pairs.sp

Shortest path algorithms

Page 29: Graphs in R and Bioconductor

1

set.seed(123)rg2 = randomEGraph(nodeNames, edges = 100)fromNode = "N43"toNode = "N81"sp = sp.between(rg2,

fromNode, toNode)

sp[[1]]$path [1] "N43" "N08" "N88" [4] "N73" "N50" "N89" [7] "N64" "N93" "N32" [10] "N12" "N81"

sp[[1]]$length [1] 10

Shortest path

Page 30: Graphs in R and Bioconductor

ap = johnson.all.pairs.sp(rg2)hist(ap)

Shortest path

Page 31: Graphs in R and Bioconductor

mst = mstree.kruskal(gr)gr

Minimal spanning tree

Page 32: Graphs in R and Bioconductor

Consider graph g with single connectedcomponent.Edge connectivity of g: minimumnumber of edges in g that can be cut toproduce a graph with two components.Minimum disconnecting set: the set ofedges in this cut.

> edgeConnectivity(g)$connectivity[1] 2

$minDisconSet$minDisconSet[[1]][1] "D" "E"

$minDisconSet[[2]][1] "D" "H"

Connectivity

Page 33: Graphs in R and Bioconductor
Page 34: Graphs in R and Bioconductor

dot: directed graphs. Works best on DAGsand other graphs that can be drawn ashierarchies.

neato: undirected graphs using ’spring’ models

twopi: radial layout. One node (‘root’) chosen asthe center. Remaining nodes on a sequence ofconcentric circles about the origin, with radialdistance proportional to graph distance. Rootcan be specified or chosen heuristically.

Rgraphviz: layout engines

Page 35: Graphs in R and Bioconductor

Rgraphviz: layout engines

Page 36: Graphs in R and Bioconductor

Rgraphviz: layout engines

Page 37: Graphs in R and Bioconductor

Combining R graphics and Rgraphviz: custom nodedrawing functions

Page 38: Graphs in R and Bioconductor

Inference questions

Compare two graphs GraphAT

Identify local features in global graphs apComplex

Large scale topological features of graphs RBGL, measurement error effects

Estimating error probabilities ppiStats

Page 39: Graphs in R and Bioconductor

Compare two graphsGraphAT

Page 40: Graphs in R and Bioconductor

Compare two graphsObserved graph of the literature protein-protein interactions

used in Ge et al. (315 edges, 298 nodes)

Page 41: Graphs in R and Bioconductor

Cluster graph for the 30 Clusters reported in Ge et al. (156,205edges, 2885 nodes)

(All genes that are not in the list of literature-reported PPIs

have been removed from this graph for visualization purposes.)

Page 42: Graphs in R and Bioconductor

Comparing two graphs Nodes – yeast genes Graph 1 – literature reported protein-protein

interactions Graph 2 – cell cycle gene expression cluster

membership Do the graphs overlap more than random? Is there anything special about the overlapping

edges?

Page 43: Graphs in R and Bioconductor

Graph of the reported intracluster edges in Ge et al.

(42 edges,65 nodes)This graph was derived by intersecting the observed literature graph with the cluster graph.

Page 44: Graphs in R and Bioconductor

Graph resulting from random reassignment

of 315 edges among 2885 nodesNote that the structure of this graph is quite different from the observed literature graph.

Page 45: Graphs in R and Bioconductor

Intersection of random edge graph with cluster graph

Page 46: Graphs in R and Bioconductor

Intersecting Edges

Random Edge algorithm (RE)

Permuting Node Labels algorithm (PN)

Page 47: Graphs in R and Bioconductor

Graph with the same test statistic as the observed graph ofintracluster edges reported in Ge et al – 42 intracluster edges

Page 48: Graphs in R and Bioconductor

Other Test Statistics

Page 49: Graphs in R and Bioconductor

Questions of Interest Is it reasonable to condition on the structure of the

observed graphs – something like anancillary/sufficient statistic?

Why is the number of intersecting edges invariant tothe node label permutation and random edgereassignment algorithms?

What are the most informative test statistics?

Page 50: Graphs in R and Bioconductor

EDA Questions of Interest Which expression clusters have intersections with which of the

literature clusters? Are known cell-cycle regulated protein complexes indeed

clustered together in both graphs? Are there expression clusters that have a number of literature

cluster edges going between them, suggesting that expressionclustering was too fine, or that literature clusters are not cell-cycle regulated.

Is the expression behavior of genes that are involved in multipleprotein complexes different from that of genes that are involvedin only one complex?

Page 51: Graphs in R and Bioconductor

Identify local featuresin global graphs

apComplex

Page 52: Graphs in R and Bioconductor

Local Modeling of Global Interactome Data

AP-MS (Affinity Purification - Mass Spectrometry)

Measures Complex Comembership

Gavin, et al. (Nature, 2002 and Nature, 2006) Ho, et al. (Nature, 2002) Krogan, et al. (Mol Cell 2004, Nature 2006)

Y2H (Yeast Two Hybrid)

Measures Physical Interactions

Ito, et al. (PNAS, 1998) Uetz, et al. (Nature, 2000)

Page 53: Graphs in R and Bioconductor

AP-MS data:

Using a bait protein, AP-MS technology finds prey proteins that arecomembers of at least one complex with the bait.

Y2H data:

Y2H technology finds pairs of physically interacting proteins.

(one purification)

bait

prey

Page 54: Graphs in R and Bioconductor

AP-MS data: Y2H data:

We want to estimate thebipartite protein complexmembership graph, A:

*Estimation of A requiresestimation of K, the numberof complexes.

Page 55: Graphs in R and Bioconductor

1. Some proteins participate in more than one complex

2. In an AP-MS experiment, some proteins are used as baits andsome proteins are only ever found as prey

3. Graph theoretic paradigm to allow for succinct formulation• Bipartite graph for complex membership (A)• Relationship of complex membership (A) to complex comembership

(Y) assayed in an AP-MS experiment (Z)• AP-MS and Y2H are different technologies that measure different

relationships between proteins

4. Statistical paradigm to allow for false positive and false negativeobservations

Four unique aspects to thealgorithm

Page 56: Graphs in R and Bioconductor

PP2A

Heterotrimericcomplex consisting of:

Tpd3- regulatory A subunit

Rts1 or Cdc55- regulatory B subunits

Pph21 or Pph22- catalytic subunits

Jiang and Broach (1999). EMBO.

1. Some proteins participate inmore than one complex

Gavin, et al. (2002)Rgraphviz plot ofyTAP C151

Bader & Hogue (2002)Portion of Figure 2: Overlap of the spoke models of TAP and HMS-PCI.

Jansen, et al. (2003)PIT Bayesian Network, LR>600

http://genecensus.org/intint

Tpd3

Pph21

Myo5

Cdc55

Cdc11

Pph22

Cdc10

Page 57: Graphs in R and Bioconductor

1. Some proteins participate in more than one complex

PP2A

Heterotrimericcomplex consisting of:

Tpd3- regulatory A subunit

Rts1 or Cdc55- regulatory B subunits

Pph21 or Pph22- catalytic subunits

Jiang and Broach (1999). EMBO.

The apComplex algorithm detects:

Zds1 and Zds2 (known cell-cycle regulators)only exist in complexes with the Cdc55-Pph22 trimer!

Page 58: Graphs in R and Bioconductor

2. Graph theoretic paradigm toallow for succinct expressionof constructs involved

•Bipartite graph forcomplex membership•Relationship of complexmembership (A) tocomplex comembership(Y) assayed in an AP-MSexperiment (Z)•AP-MS and Y2H aredifferent technologies thatmeasure differentrelationships betweenproteins

Page 59: Graphs in R and Bioconductor

2. Graph theoreticparadigm to allow forsuccinct expression ofconstructs involved

•Relationship ofcomplex membership(A) to complexcomembership (Y)assayed in an AP-MSexperiment (Z)

Page 60: Graphs in R and Bioconductor

)ˆ,ˆ(algorithmestimation

proteinsonly -hit and proteins bait for dataMSAP'

AYZ

ZYA

M

NAAY

!!!!!! "!

!!!!!!!!! "!!!! "!#$=

In summary…

We start with an initial estimate for A, and then refine thatestimate according to a two component probability measure:

P(Z|A,µ,α)=L(Z|Y=A⊗A',µ,α)C (Z|A,µ,α)usual likelihood regularization/penalty term

(no. of complexes)

Page 61: Graphs in R and Bioconductor

Large scale topological featuresof graphs

RBGLmeasurement error

Page 62: Graphs in R and Bioconductor

Apl6

Apm3 Apl5

untested: ?

tested:absent

Apl6

Apm3 Apl5

Eno2

Aps3

Ckb1

6 AP-MS observationsGavin et al. (2002)

Apl5: Apl5, Apl6, Apm3, Aps3, Ckb1Apl6: Apl5, Apl6, Apm3, Eno2Apm3: Apl6, Apm3

Measurement ErrorFPsFNsStochasticSystematic

Missing DataUntested Edges

Suitable InferenceWhat can weconclude?

Page 63: Graphs in R and Bioconductor

Statistical Modeling Network data are experimentally obtained

and hence should be subjected to sametypes of data analysis as other data

Given some model, what can we conclude?

Likelihood methods Global and local feature estimation

Page 64: Graphs in R and Bioconductor

Measurement Error

Stochastic FPs and FNs made ‘randomly’

Systematic FPs and FNs made in some predictable way ‘Sticky’ proteins may cause FPs Conformationally deformed proteins may cause FNs Errant treatment of untested edges as absent will

cause systematic FNs

Page 65: Graphs in R and Bioconductor

Random graphs vs regularnetworks and what’s in between?

Small-world connectivity Watts & Strogatz (1998) 6-degrees-of separation Lsmall-world ≈ Lrandom

L=average path length Csmall-world >> Crandom

C=average clustering coefficient of nodes If node n has kn neighbors, then

21

#

)/-(kk

kC

nn

nn

neighbors between edges observed =

Page 66: Graphs in R and Bioconductor

Small-world

Watts & Strogatz (Nature 1998)

Page 67: Graphs in R and Bioconductor

Scale-free

Class of small world networks Node degree distribution follows a power

law In scale-free graphs there are a few highly

connected “hubs” Biology: Relative robustness of network to

random perturbation, but huge breakdownto targeted disruption

Page 68: Graphs in R and Bioconductor

Simulation Study Random, Scale-Free, Overlapping Cluster

Graphs 50, 500, and 1000 nodes Approx 5 edges/node Stochastic FNs: 0.05, 0.15, 0.25 Stochastic FPs: PPV=0.50 Systematic FPs: ‘sticky’ baits detect

neighbors of neighbors with p=0.50

Page 69: Graphs in R and Bioconductor

L, stochastic FNs

Page 70: Graphs in R and Bioconductor

L, stochastic FNs

In general, FNs increase L In the overlapping cluster graph, FNs splinter

graph into unconnected components How to treat L for an unconnected graph? Misleading results if treated naïvely

Page 71: Graphs in R and Bioconductor

C, stochastic FPs(probability an observed edge is true =0.5)

Page 72: Graphs in R and Bioconductor

C, stochastic FPs

If a node has k neighbors, C is the fraction ofthose neighbors that exist

Both numerator and denominator areaffected

For cluster graphs, nodes have moreneighbors due to FPs, but the number ofedges between the neighbors does notincrease proportionately

Page 73: Graphs in R and Bioconductor

Node degree distribution,systematic FPs

Looked at log-log plot of complementarycumulative distribution function (notfrequency distribution) Based on theoretical work by Li, et al (2006)

Towards a theory of scale-free graphs: Definition,properties, and implications. InternetMathematics.

Assess fit of straight line using R2

Page 74: Graphs in R and Bioconductor

Node degree distribution,systematic FPs

Page 75: Graphs in R and Bioconductor

Take Home Messagefrom Simulation Study

FPs and FNs do affect statistics on graphs Even with small amounts of measurement

error, the effects can lead to biologicalmisinterpretations Specifically, scale-free

Page 76: Graphs in R and Bioconductor

Estimating error probabilities

ppiData

Page 77: Graphs in R and Bioconductor

Systematic Error:Bias in Bait-Prey Systems

Apl6

Apm3 Apl5

Apl6

Apm3 Apl5

Doubly tested bait-bait edges may be• reciprocated

•tested twice, observed twice• unreciprocated

•tested twice, observed once

For a bait subject only to stochasticerror, we expect the set ofunreciprocated edges to consist ofan approximately equal number of in-and out-edges.

If this is not the case, the bait issubject to systematic bias.

Page 78: Graphs in R and Bioconductor

In-degree vs. Out-degreeunreciprocated edges in bait-induced subgraphs (square root scale)

Gavin, 2006 (AP-MS) Krogan, 2006 (AP-MS)

Page 79: Graphs in R and Bioconductor

Per protein ‘coin-tossing’model

Quantify departure from symmetry using abinomial distribution with probabilityparameter p=0.50.

Previous pictures show nodes with p-value<0.01 in dark blue.

Removing these nodes has implications foroverall estimates of stochastic FP and FNprobabilities.

Page 80: Graphs in R and Bioconductor

Global estimation of pTP andpFP

n.convolutio the withdeal weSo

. and observe We

.)1())1(2()(

),|,Pr(

, graph the in edges potential of number total the for Then,

ns.observatio positive false the for variables random be and let Similarly,

.)1())1(2(),|,Pr(

, graph the in edges of number true the given Then,

ns.observatio positive true the for variables random be and Let

))((22

)(22

FTFT

urN

FP

u

FPFP

r

FP

FFTFF

F

FPTFFFF

FF

ur

TP

u

TPTP

r

TP

TTTTT

T

TPTTTTT

T

TT

UUURRR

ppppurNur

puUrR

N

UR

ppppurur

puUrR

UR

FFTFF

TTTTT

+=+=

!!""#

$%%&

'

!!(!

(=(==

!!""#

$%%&

'

!!(

(=(==

(

!!(!

!!(

Page 81: Graphs in R and Bioconductor

Estimating pTP and pFP Using the expectation of R and U, we can derive two

independent equations with three unknown parameters (pTP, pFP,and ΔT).

For resultant the method of moments estimators, any one of theparameters defines the other two, so we can easily derive a one-dimensional solution manifolds (with variance bounds).

In practice for AP-MS data, we choose to estimate pTP using agold standard set of complex co-membership relationships thatexist under similar experimental conditions to those used for AP-MS studies.

Page 82: Graphs in R and Bioconductor

Including systematically biased baits Without systematically biased baits

Page 83: Graphs in R and Bioconductor

Apl6

Apm3 Apl5

untested: ?

tested:absent

Apl6

Apm3 Apl5

Eno2

Aps3

Ckb1

What is an example and do statistics really make a difference?

Bait-induced subgraph:

Degree = number of edgesincident on a node

For node n we observe:

IEn = set of in-edgesOEn = set of out-edges

IDn = |IEn| = in-degreeODn = |OEn| = out-degree

Rn = |IEn ∩ OEn| = ‘reciprocated’ degree

Un = |{IEn U OEn}\{IEn ∩ OEn}|= ‘unreciprocated’ degree

IEApm3 = {Apl5, Apl6}OEApm3 = {Apl6}

IDApm3 = 2ODApm3 = 1

RApm3 = 1UApm3 = 1

Estimating degreeunder stochastic error only

Page 84: Graphs in R and Bioconductor

Current practice : Dn=Rn+UnDApm3=2

Apl6

Apm3 Apl5

Apl6

Apm3 Apl5

Eno2

Aps3

Ckb1

What is an example and do statistics really make a difference?

Page 85: Graphs in R and Bioconductor

Likelihood ApproachpTP=true positive probabilitypFP=false positive probability

If pFP=0, then for ‘true’ edgespT2=p(observing reciprocated edge) = pTP

2

pT1=p(observing unreciprocated edge)=2pTP(1-pTP)pT0=p(observing no edge)=(1-pTP)2

ur

T

u

T

r

T

T

TP pppururp

puUrR!!"

" ##$

%&&'

(

!!"

"

!="==

="

012

01

1),|,Pr(

degree Given

Write a similar statement for FPs and then maximize the convolution to estimate degree.

Page 86: Graphs in R and Bioconductor

Notes Independence assumption allows product of

multinomial probabilities Only models stochastic error Only one observation in the likelihood Truncated binomial for data not subject to

FPs was developed by Blumenthal andDahiya (JASA 1981) and Olkin et al. (JASA1981).

Page 87: Graphs in R and Bioconductor

Summary

Three main applications(1) Knowledge representation(2) Exploratory data analysis(3) Inference

Bioconductor provides a rich set of tools for(1) and (2)…need more of (3)!

Page 88: Graphs in R and Bioconductor

Acknowledgements

Robert Gentleman Wolfgang Huber (thanks for lots of slides) Vince Carey Jeff Gentry Elizabeth Whalen Seth Falcon