Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Graphs in R andBioconductor
Statistical Analysis of Microarray Expression Data with Rand Bioconductor
Copenhagen DK, November 2007
Denise Scholtens, Ph.D.
Assistant Professor, Department of Preventive MedicineNorthwestern University Medical School, Chicago IL USA
Graphs
Sets of nodes and edges Nodes: objects of interest Edges: relationships between them
A useful abstraction to talk aboutrelationships, interactions, etc.
Graphs Nodes
Node types Conditions on interactions between nodes
Edges Edge types Direction Weights
Graphs are very flexible, but can quickly getcomplicated!
Graphs Knowledge representation
Data structure, visualization Exploratory data analysis (EDA)
Graph traversal and analysis Inference
Adopting statistical paradigm for makingconclusions about data recorded in graphs
Graphs Inference
Ex. Testing association between two graphs Ex. Identifying local features of large global graphs Statistical approaches
FP – edges that were tested, were found, but are not there innature
FN – edges that were tested, were not found, but are there innature
Untested – edges that were never tested, may or may notexist in nature, but we don’t know
Example: undirected graph
Example: directed graph
Elementary computations on IMCA pathway
> library("graph")> data("integrinMediatedCellAdhesion")> class(IMCAGraph)> s = acc(IMCAGraph, "SOS")Ha-Ras Raf MEK 1 2 3 ERK MYLK MYO 4 5 6F-actin cell proliferation 7 5
Example: Directed AcyclicGraph (DAG) Gene Ontology (GO) A structed vocabulary to describe
molecular function of gene products,biological processes, and cellularcomponents.
A set of "is a", "is part of", and "has a"relationships between these terms
GO graphs
Example: Bipartite graph
Two distinct sets of nodes U and V Edges exist between elements of U and V Edges cannot connect nodes in U to other
nodes in U, and similarly for V E.g. literature co-citation graphs
Gene-Literature graphs
DKC1
An adjacency matrix AG (n x m) is often used torepresent a bipartite graph G with node sets U, V
One mode graphs
AU = AGt AG – co-citation of genes in literature
AV = AG AG
t – literature containing common genes
(Boolean algebra)
Bipartite graph transformation
Data structure
Flexible way to record and visualize data Undirected/directed edges Node colors Structures (DAG, bipartite graph) Multiple edges Weighted or labeled edges
Note: user has responsibility to recognize and makeuse of the graph structure!
Directed, undirected graphsAdjacent nodesAccessible nodesSelf-loopNode degreeWalk: alternating sequence of nodes and incident edgesClosed walkDistance between nodes, shortest walkTrail: walk with no repeated edgesPath: trail with no repeated nodes (except possibly first/last)Connected graphWeakly connected directed graphStrongly connected directed graph
Graphs: vocabulary
graph basic class definitions andfunctionality
Rgraphviz rendering functionalityDifferent layout algorithms.Node plotting, line type, color etc. can becontrolled by the user.
RBGL interface to graph algorithms
graph, Rgraphviz, RBGL
graph package Classes
graph a general class that all other classes should extend
graphNEL node/edge-list representation Can specify direction of edges, edge weights, etc
distGraph graph based on distances between nodes
clusterGraph series of completely connected subgraphs (cliques) with
no edges between them
> library("graph"); library(Rgraphviz)
> myNodes = c("s", "p", "q", "r")
> myEdges = list(s = list(edges = c("p", "q")),p = list(edges = c("p", "q")),q = list(edges = c("p", "r")),r = list(edges = c("s")))
> g = new("graphNEL", nodes = myNodes,edgeL = myEdges, edgemode ="directed")
> plot(g)
Creating a graph
> nodes(g)[1] "s" "p" "q" "r"
> edges(g)$s[1] "p" "q"$p[1] "p" "q"$q[1] "p" "r"$r[1] "s"
> degree(g)$inDegrees p q r1 3 2 1$outDegrees p q r2 2 2 1
Querying nodes, edges, degree
> g1 <- addNode("e", g)
> g2 <- removeNode("d", g)
> ## addEdge(from, to, graph, weights)
> g3 <- addEdge("e", "a", g1, pi/2)
> ## removeEdge(from, to, graph)
> g4 <- removeEdge("e", "a", g3)
> identical(g4, g1)
[1] TRUE
Graph manipulation
> adj(g, c("b", "c"))$b[1] "b" "c"$c[1] "b" "d"
> acc(g, c("b", "c"))$ba c d3 1 2
$ca b d2 1 1
Adjacent and accessible nodes
Node-edge lists Adjacency matrix (straightforward) Adjacency matrix (sparse) From-To matrix
They are equivalent, but may be hugely differentin performance and convenience for differentapplications.
Can coerce between the representations
Graph representation
> ft [,1] [,2][1,] 1 2[2,] 2 3[3,] 3 1[4,] 4 4
> ftM2adjM(ft) 1 2 3 41 0 1 0 02 0 0 1 03 1 0 0 04 0 0 0 1
> ftM2graphNEL(ft)A graphNEL graph with directed edgesNumber of Nodes = 4Number of Edges = 4
Graph representations:from-to matrix
Connected componentscc = connComp(rg)table(listLen(cc)) 1 2 3 4 15 1836 7 3 2 1 1
Choose the largest componentwh = which.max(listLen(cc))sg = subGraph(cc[[wh]], rg)
Depth first searchdfsres = dfs(sg, node = "N14")nodes(sg)[dfsres$discovered][1] "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53"[9] "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35"[17] "N48" "N09"
rg
RBGL: interface to BoostGraph Library
dfs(sg, "N14")bfs(sg, "N14")
depth/breadth first search
connected componentssc = strongComp(g2)
nattrs = makeNodeAttrs(g2,fillcolor="")
for(i in 1:length(sc)) nattrs$fillcolor[sc[[i]]] =
myColors[i]
plot(g2, "dot", nodeAttrs=nattrs)
wc = connComp(g2)
Different algorithms for different types of graphso all edge weights the sameo positive edge weightso real numbers
…and different settings of the problemo single pairo single sourceo single destinationo all pairs
Functionsbfsdijkstra.spsp.betweenjohnson.all.pairs.sp
Shortest path algorithms
1
set.seed(123)rg2 = randomEGraph(nodeNames, edges = 100)fromNode = "N43"toNode = "N81"sp = sp.between(rg2,
fromNode, toNode)
sp[[1]]$path [1] "N43" "N08" "N88" [4] "N73" "N50" "N89" [7] "N64" "N93" "N32" [10] "N12" "N81"
sp[[1]]$length [1] 10
Shortest path
ap = johnson.all.pairs.sp(rg2)hist(ap)
Shortest path
mst = mstree.kruskal(gr)gr
Minimal spanning tree
Consider graph g with single connectedcomponent.Edge connectivity of g: minimumnumber of edges in g that can be cut toproduce a graph with two components.Minimum disconnecting set: the set ofedges in this cut.
> edgeConnectivity(g)$connectivity[1] 2
$minDisconSet$minDisconSet[[1]][1] "D" "E"
$minDisconSet[[2]][1] "D" "H"
Connectivity
dot: directed graphs. Works best on DAGsand other graphs that can be drawn ashierarchies.
neato: undirected graphs using ’spring’ models
twopi: radial layout. One node (‘root’) chosen asthe center. Remaining nodes on a sequence ofconcentric circles about the origin, with radialdistance proportional to graph distance. Rootcan be specified or chosen heuristically.
Rgraphviz: layout engines
Rgraphviz: layout engines
Rgraphviz: layout engines
Combining R graphics and Rgraphviz: custom nodedrawing functions
Inference questions
Compare two graphs GraphAT
Identify local features in global graphs apComplex
Large scale topological features of graphs RBGL, measurement error effects
Estimating error probabilities ppiStats
Compare two graphsGraphAT
Compare two graphsObserved graph of the literature protein-protein interactions
used in Ge et al. (315 edges, 298 nodes)
Cluster graph for the 30 Clusters reported in Ge et al. (156,205edges, 2885 nodes)
(All genes that are not in the list of literature-reported PPIs
have been removed from this graph for visualization purposes.)
Comparing two graphs Nodes – yeast genes Graph 1 – literature reported protein-protein
interactions Graph 2 – cell cycle gene expression cluster
membership Do the graphs overlap more than random? Is there anything special about the overlapping
edges?
Graph of the reported intracluster edges in Ge et al.
(42 edges,65 nodes)This graph was derived by intersecting the observed literature graph with the cluster graph.
Graph resulting from random reassignment
of 315 edges among 2885 nodesNote that the structure of this graph is quite different from the observed literature graph.
Intersection of random edge graph with cluster graph
Intersecting Edges
Random Edge algorithm (RE)
Permuting Node Labels algorithm (PN)
Graph with the same test statistic as the observed graph ofintracluster edges reported in Ge et al – 42 intracluster edges
Other Test Statistics
Questions of Interest Is it reasonable to condition on the structure of the
observed graphs – something like anancillary/sufficient statistic?
Why is the number of intersecting edges invariant tothe node label permutation and random edgereassignment algorithms?
What are the most informative test statistics?
EDA Questions of Interest Which expression clusters have intersections with which of the
literature clusters? Are known cell-cycle regulated protein complexes indeed
clustered together in both graphs? Are there expression clusters that have a number of literature
cluster edges going between them, suggesting that expressionclustering was too fine, or that literature clusters are not cell-cycle regulated.
Is the expression behavior of genes that are involved in multipleprotein complexes different from that of genes that are involvedin only one complex?
Identify local featuresin global graphs
apComplex
Local Modeling of Global Interactome Data
AP-MS (Affinity Purification - Mass Spectrometry)
Measures Complex Comembership
Gavin, et al. (Nature, 2002 and Nature, 2006) Ho, et al. (Nature, 2002) Krogan, et al. (Mol Cell 2004, Nature 2006)
Y2H (Yeast Two Hybrid)
Measures Physical Interactions
Ito, et al. (PNAS, 1998) Uetz, et al. (Nature, 2000)
AP-MS data:
Using a bait protein, AP-MS technology finds prey proteins that arecomembers of at least one complex with the bait.
Y2H data:
Y2H technology finds pairs of physically interacting proteins.
(one purification)
bait
prey
AP-MS data: Y2H data:
We want to estimate thebipartite protein complexmembership graph, A:
*Estimation of A requiresestimation of K, the numberof complexes.
1. Some proteins participate in more than one complex
2. In an AP-MS experiment, some proteins are used as baits andsome proteins are only ever found as prey
3. Graph theoretic paradigm to allow for succinct formulation• Bipartite graph for complex membership (A)• Relationship of complex membership (A) to complex comembership
(Y) assayed in an AP-MS experiment (Z)• AP-MS and Y2H are different technologies that measure different
relationships between proteins
4. Statistical paradigm to allow for false positive and false negativeobservations
Four unique aspects to thealgorithm
PP2A
Heterotrimericcomplex consisting of:
Tpd3- regulatory A subunit
Rts1 or Cdc55- regulatory B subunits
Pph21 or Pph22- catalytic subunits
Jiang and Broach (1999). EMBO.
1. Some proteins participate inmore than one complex
Gavin, et al. (2002)Rgraphviz plot ofyTAP C151
Bader & Hogue (2002)Portion of Figure 2: Overlap of the spoke models of TAP and HMS-PCI.
Jansen, et al. (2003)PIT Bayesian Network, LR>600
http://genecensus.org/intint
Tpd3
Pph21
Myo5
Cdc55
Cdc11
Pph22
Cdc10
1. Some proteins participate in more than one complex
PP2A
Heterotrimericcomplex consisting of:
Tpd3- regulatory A subunit
Rts1 or Cdc55- regulatory B subunits
Pph21 or Pph22- catalytic subunits
Jiang and Broach (1999). EMBO.
The apComplex algorithm detects:
Zds1 and Zds2 (known cell-cycle regulators)only exist in complexes with the Cdc55-Pph22 trimer!
2. Graph theoretic paradigm toallow for succinct expressionof constructs involved
•Bipartite graph forcomplex membership•Relationship of complexmembership (A) tocomplex comembership(Y) assayed in an AP-MSexperiment (Z)•AP-MS and Y2H aredifferent technologies thatmeasure differentrelationships betweenproteins
2. Graph theoreticparadigm to allow forsuccinct expression ofconstructs involved
•Relationship ofcomplex membership(A) to complexcomembership (Y)assayed in an AP-MSexperiment (Z)
)ˆ,ˆ(algorithmestimation
proteinsonly -hit and proteins bait for dataMSAP'
AYZ
ZYA
M
NAAY
!!!!!! "!
!!!!!!!!! "!!!! "!#$=
In summary…
We start with an initial estimate for A, and then refine thatestimate according to a two component probability measure:
P(Z|A,µ,α)=L(Z|Y=A⊗A',µ,α)C (Z|A,µ,α)usual likelihood regularization/penalty term
(no. of complexes)
Large scale topological featuresof graphs
RBGLmeasurement error
Apl6
Apm3 Apl5
untested: ?
tested:absent
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
6 AP-MS observationsGavin et al. (2002)
Apl5: Apl5, Apl6, Apm3, Aps3, Ckb1Apl6: Apl5, Apl6, Apm3, Eno2Apm3: Apl6, Apm3
Measurement ErrorFPsFNsStochasticSystematic
Missing DataUntested Edges
Suitable InferenceWhat can weconclude?
Statistical Modeling Network data are experimentally obtained
and hence should be subjected to sametypes of data analysis as other data
Given some model, what can we conclude?
Likelihood methods Global and local feature estimation
Measurement Error
Stochastic FPs and FNs made ‘randomly’
Systematic FPs and FNs made in some predictable way ‘Sticky’ proteins may cause FPs Conformationally deformed proteins may cause FNs Errant treatment of untested edges as absent will
cause systematic FNs
Random graphs vs regularnetworks and what’s in between?
Small-world connectivity Watts & Strogatz (1998) 6-degrees-of separation Lsmall-world ≈ Lrandom
L=average path length Csmall-world >> Crandom
C=average clustering coefficient of nodes If node n has kn neighbors, then
21
#
)/-(kk
kC
nn
nn
neighbors between edges observed =
Small-world
Watts & Strogatz (Nature 1998)
Scale-free
Class of small world networks Node degree distribution follows a power
law In scale-free graphs there are a few highly
connected “hubs” Biology: Relative robustness of network to
random perturbation, but huge breakdownto targeted disruption
Simulation Study Random, Scale-Free, Overlapping Cluster
Graphs 50, 500, and 1000 nodes Approx 5 edges/node Stochastic FNs: 0.05, 0.15, 0.25 Stochastic FPs: PPV=0.50 Systematic FPs: ‘sticky’ baits detect
neighbors of neighbors with p=0.50
L, stochastic FNs
L, stochastic FNs
In general, FNs increase L In the overlapping cluster graph, FNs splinter
graph into unconnected components How to treat L for an unconnected graph? Misleading results if treated naïvely
C, stochastic FPs(probability an observed edge is true =0.5)
C, stochastic FPs
If a node has k neighbors, C is the fraction ofthose neighbors that exist
Both numerator and denominator areaffected
For cluster graphs, nodes have moreneighbors due to FPs, but the number ofedges between the neighbors does notincrease proportionately
Node degree distribution,systematic FPs
Looked at log-log plot of complementarycumulative distribution function (notfrequency distribution) Based on theoretical work by Li, et al (2006)
Towards a theory of scale-free graphs: Definition,properties, and implications. InternetMathematics.
Assess fit of straight line using R2
Node degree distribution,systematic FPs
Take Home Messagefrom Simulation Study
FPs and FNs do affect statistics on graphs Even with small amounts of measurement
error, the effects can lead to biologicalmisinterpretations Specifically, scale-free
Estimating error probabilities
ppiData
Systematic Error:Bias in Bait-Prey Systems
Apl6
Apm3 Apl5
Apl6
Apm3 Apl5
Doubly tested bait-bait edges may be• reciprocated
•tested twice, observed twice• unreciprocated
•tested twice, observed once
For a bait subject only to stochasticerror, we expect the set ofunreciprocated edges to consist ofan approximately equal number of in-and out-edges.
If this is not the case, the bait issubject to systematic bias.
In-degree vs. Out-degreeunreciprocated edges in bait-induced subgraphs (square root scale)
Gavin, 2006 (AP-MS) Krogan, 2006 (AP-MS)
Per protein ‘coin-tossing’model
Quantify departure from symmetry using abinomial distribution with probabilityparameter p=0.50.
Previous pictures show nodes with p-value<0.01 in dark blue.
Removing these nodes has implications foroverall estimates of stochastic FP and FNprobabilities.
Global estimation of pTP andpFP
n.convolutio the withdeal weSo
. and observe We
.)1())1(2()(
),|,Pr(
, graph the in edges potential of number total the for Then,
ns.observatio positive false the for variables random be and let Similarly,
.)1())1(2(),|,Pr(
, graph the in edges of number true the given Then,
ns.observatio positive true the for variables random be and Let
))((22
)(22
FTFT
urN
FP
u
FPFP
r
FP
FFTFF
F
FPTFFFF
FF
ur
TP
u
TPTP
r
TP
TTTTT
T
TPTTTTT
T
TT
UUURRR
ppppurNur
puUrR
N
UR
ppppurur
puUrR
UR
FFTFF
TTTTT
+=+=
!!""#
$%%&
'
!!(!
(=(==
!!""#
$%%&
'
!!(
(=(==
(
!!(!
!!(
Estimating pTP and pFP Using the expectation of R and U, we can derive two
independent equations with three unknown parameters (pTP, pFP,and ΔT).
For resultant the method of moments estimators, any one of theparameters defines the other two, so we can easily derive a one-dimensional solution manifolds (with variance bounds).
In practice for AP-MS data, we choose to estimate pTP using agold standard set of complex co-membership relationships thatexist under similar experimental conditions to those used for AP-MS studies.
Including systematically biased baits Without systematically biased baits
Apl6
Apm3 Apl5
untested: ?
tested:absent
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
What is an example and do statistics really make a difference?
Bait-induced subgraph:
Degree = number of edgesincident on a node
For node n we observe:
IEn = set of in-edgesOEn = set of out-edges
IDn = |IEn| = in-degreeODn = |OEn| = out-degree
Rn = |IEn ∩ OEn| = ‘reciprocated’ degree
Un = |{IEn U OEn}\{IEn ∩ OEn}|= ‘unreciprocated’ degree
IEApm3 = {Apl5, Apl6}OEApm3 = {Apl6}
IDApm3 = 2ODApm3 = 1
RApm3 = 1UApm3 = 1
Estimating degreeunder stochastic error only
Current practice : Dn=Rn+UnDApm3=2
Apl6
Apm3 Apl5
Apl6
Apm3 Apl5
Eno2
Aps3
Ckb1
What is an example and do statistics really make a difference?
Likelihood ApproachpTP=true positive probabilitypFP=false positive probability
If pFP=0, then for ‘true’ edgespT2=p(observing reciprocated edge) = pTP
2
pT1=p(observing unreciprocated edge)=2pTP(1-pTP)pT0=p(observing no edge)=(1-pTP)2
ur
T
u
T
r
T
T
TP pppururp
puUrR!!"
" ##$
%&&'
(
!!"
"
!="==
="
012
01
1),|,Pr(
degree Given
Write a similar statement for FPs and then maximize the convolution to estimate degree.
Notes Independence assumption allows product of
multinomial probabilities Only models stochastic error Only one observation in the likelihood Truncated binomial for data not subject to
FPs was developed by Blumenthal andDahiya (JASA 1981) and Olkin et al. (JASA1981).
Summary
Three main applications(1) Knowledge representation(2) Exploratory data analysis(3) Inference
Bioconductor provides a rich set of tools for(1) and (2)…need more of (3)!
Acknowledgements
Robert Gentleman Wolfgang Huber (thanks for lots of slides) Vince Carey Jeff Gentry Elizabeth Whalen Seth Falcon