42
A Double Spectral Approach to Disease Module Identification Jake Crawford, Junyuan Lin, Xiaozhe Hu, Benjamin Hescott, Donna Slonim, and Lenore Cowen Tufts University

A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

A Double Spectral Approach to Disease Module Identification

Jake Crawford, Junyuan Lin, Xiaozhe Hu, Benjamin Hescott, Donna Slonim, and Lenore Cowen

Tufts University

Page 2: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our Goal

• Identify module structure in biological networks, enriched for a collection of disease gene sets.

Page 3: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

PPI network: issue with defining local neighborhoods: “small-world” network where everything is nearby!

Page 4: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dream Network 1

Page 5: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dream Network 2

Page 6: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dream Network 3

Page 7: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dream Network 4

Page 8: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dream Network 5

Page 9: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Dream Network 6

Page 10: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our Approach

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 11: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our Approach

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 12: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our insight: not all short paths give equal indications of similarity

Define He (A,B) to be the expected number of times a k-step random walk starting at A reaches B (including 0-length walk)

k

Page 13: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

DSD: A Spectral Distance Metric

• Hek (A) = (Hek (A, v1), Hek (A, v2), …, Hek (A, vn))

• DSDk (A, B) = | Hek (A) - Hek (B) |1

• We prove DSD is a metric, and converges as k→∞ : call the converged version DSD (A, B) .

Page 14: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

DSD spreads out distances

Page 15: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our Approach

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 16: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Speeding Up DSD

For an ergodic network, the DSD distance between two nodes u and v can also be written as

where bi is the ith basis vector, I is the identity matrix, P is the transition matrix of the network, and W is a matrix where each row is a copy of πT = the steady state distribution of the network.

(proof of this in Cao et al. 2013)

Page 17: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Speeding Up DSD

• Computing the matrix inverse (I - P + W)-1 is the most computationally expensive part

• We developed an efficient algorithm to reduce the amortized time for sparse networks, using algebraic multigrid methods

Page 18: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

The method only computes approximate DSD; some artifacts, good enough

Page 19: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our Approach

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 20: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Genetic interactions (epistasis)

• For non-essential genes, we can compare the growth of the double knockout to its component single knockouts

Picture: Ulitsky

Page 21: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Finding Bipartite Subgraphs

• “Between Pathway Model” (Kelley and Ideker, 2005)

• Regions having many negative genetic interactions between them (and physical interactions within regions)

• May indicate compensatory biological pathways

• Use Brady et. al 2009 definition; based only on negative genetic interactions (ignore physical interactions)

Page 22: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

What is the Quality of a BPM?

Once we obtain a candidate BPM we can score it using

interaction data.

Sum interactions within

Sum interactions between

Take the difference and normalize to create an

interaction score

-0.664347

0.553838

-7.321556

-6.315511

3.685398

-5.252571

-3.365368

3.236723

-1.366879

2.13473

0.13342

Look for a collection of BPMs in half the networks

Page 23: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Our Approach

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 24: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Subchallenge 1, Step 1: Get the DSD matrix for each network

Page 25: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

What about the directed network?

• We wanted to preserve edge direction, but keep the network (strongly) connected to run DSD in a sensible way

• Add low-weight back edges if none exists already

• Gave better results than treating all edges as undirected

u v

0.8

u v

0.8

u v

0.8

0.001

Page 26: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

We have a distance matrix. Now what?

• Many pre-existing methods to cluster a pairwise distance (or similarity) matrix

• Spectral clustering performed the best in the leaderboard rounds

Page 27: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Spectral Clustering

• Convert distance matrix to similarity matrix using radial basis function (RBF) kernel (high distance -> low similarity and vice-versa)

• Dimension reduction + k-means clustering (used default scikit-learn implementation)

• Tested different values of k in the training rounds

Pedregosa et al, JMLR 12:2825-2830 (2011)

Page 28: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Network 1 - Choosing k

Submission NS (train) NS for different FDR (1%, 2.5%, 10%) NS (test)

k = 1000 14 (8, 10, 22) 16

k = 1200 14 (3, 7, 19) N/A

k = 500, altered weighting scheme 14 (5, 9, 23) N/A

k = 500 12 (3, 7, 20) N/A

Page 29: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Network 5 - Choosing k

Submission NS (train) NS for different FDR (1%, 2.5%, 10%) NS (test)

GC and k=100 overlap 6 (1, 2, 8) N/A

GC and k=200 overlap 5 (3, 3, 7) 5

k=100 5 (1, 3, 6) N/A

Page 30: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

So far we have:

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 31: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

• Small clusters (size < 3): throw out

• Large clusters (size > 100): partition recursively (again using spectral clustering on the DSD matrix), until all clusters are of size < 100

• In practice, splitting large clusters improved most of the results we tested, and only hurt one result (out of ~15 comparisons)

Correcting Cluster Sizes

Page 32: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

• For networks 1, 4, and 6, we’re done.

• For networks 2, 3, and 5, we tried one more trick…looking for BPMs!

Page 33: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Finding Bipartite Subgraphs

• We can identify many candidates by looking only at negative genetic interactions (Brady et al., 2009)

• We suspected there might be bipartite subgraphs in networks 2, 3, and 5

• We used the Genecentric software package to identify bipartite substructure

Page 34: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Clusters from bipartite subgraphs

Clusters from spectral clustering

Single set of final clusters

But, we’re not quite finished

(for networks 2, 3, and 5)

Page 35: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Merging Clusters: Approach 1 (greedy)

1

1 2 3

4 5

2 3 4

5 6

7 8

1 2 3 4 5 7 8

9

9

Spectral clusters Genecentric clusters

Page 36: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Merging Clusters: Approach 2 (overlap)

1 1 2 3

4 5

2 3 4

5 6

7 8

1 2 3 7 8

6 7 8

1

2

3

1

2

3

4 5 6

(1, 1) (1, 1) (1, 1) (2, 2)(1, 2) (2, 3) (3, 3) (3, 3)

9 9

9

(3, 3)

Spectral clusters Genecentric clusters

Page 37: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Looking back:

Cluster the resulting distance matrix

Correct cluster sizesLook for bipartite

subgraphs

Return valid clusters

DSDDiffusion State Distance

Page 38: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Subchallenge 2

• Took union of networks

• For overlapping edges with weights w1, w2, create edge with weight max(w1, w2) + 0.05

• Same DSD + spectral clustering approach as before

• Union of networks 1, 2, and 4 performed the best across FDR levels

Page 39: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Subchallenge 2: Step 0: Superimpose networks 1,2 and 4

Step 1: Get the DSD matrix for the combined network

Page 40: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Future Directions

• Test modifications to edge weights

• Modify other clustering algorithms to run on top of DSD, there are many possible alternatives

• We could also test additional methods of combining edges (or adding networks) for Subchallenge 2

Page 41: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Acknowledgements

Thanks to the Tufts Bioinformatics and Computational Biology group

for helpful ideas and critiques

Page 42: A Double Spectral Approach to Disease Module Identificationjjc2718.github.io/rsg_dream_talk.pdf · Finding Bipartite Subgraphs • “Between Pathway Model” (Kelley and Ideker,

Looking back

Cluster the resulting distance matrix(spectral clustering)

Correct cluster sizes(recursive spectral

clustering)

Look for bipartite subgraphs (Genecentric)

Return valid clusters (putative disease modules)

Define a network distance measure (DSD)