Upload
noel-jenkins
View
220
Download
0
Embed Size (px)
Citation preview
Copyright 2006, Data Mining Research Laboratory
Integrated Mining of PPI Networks: A Case for Ensemble
Clustering
Srinivasan ParthasarathyDepartment of Computer Science and
EngineeringThe Ohio State University
Joint work with Sitaram Asur and Duygu Ucar
Copyright 2006, Data Mining Research Laboratory
Proteins
• Central component of cell machinery and life– It is the proteins dynamically generated by a cell
that execute the genetic program [Kahn 1995]
• Proteins work with other proteins [Von Mering et al 2002]– Form large interaction networks typically refered
to as protein-protein interaction (PPI) networks– Regulate and support each other for specific
functionality or process
Copyright 2006, Data Mining Research Laboratory
Protein Protein Interaction Networks• Why analyze?
– To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002]
• Understanding the organism
– Protein function prediction• E.g. no functional annotations for one-third of baker’s yeast
– Drug design• Goal: To find modular clusters
Copyright 2006, Data Mining Research Laboratory
Challenges in analyzing PPI Networks
– Noisy data
• False positives [Deane 2002], false negatives [Hsu 06]– Existence of Hub Nodes
• Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else!
– Proteins can be multi-faceted• Can belong to multiple functional groups – most clustering
algorithms are hard – need for soft or fuzzy clustering– Data Integration Issues
• Multiple Sources– 2-Hyrbid, Mass Spectrometry, genetic co-occurrence
• Different targets– Y2H, Mass Spec – target binding– Gene co-occurrence – target functional
• Different weaknesses (missing certain interactions)– Y2H – translation– mass-spectrometry – transport & sensing
Copyright 2006, Data Mining Research Laboratory
Ensemble Clustering
• A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03]
• Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement
• Our hypothesis: Potentially offers a viable solution for problems simultaneously– Given nice theory in the context of classification it is likely to
be particularly useful in a noisy environment.• A weak analogy to the audience vote in millionaire
– Naturally handles arrangements produced from different sources or domain driven segmentation.
Copyright 2006, Data Mining Research Laboratory
Ensemble Clustering on PPI networks:Key Questions
• What are the base clustering methods and arrangements to use in the context of interaction networks?– How to handle the influence of noise and hubs?
• How do we scale to problems of the scale of interaction networks?
• How do we address the issue of soft clustering?
• How to address the issue of data integration?– Another day another time
Copyright 2006, Data Mining Research Laboratory
Birds-eye-view (coarse grained)
Clustering Arrangements
Topology-basedSimilarity Metrics
Clustering Algorithms
Cluster Representation(soft)Consensus Clustering
Final clusters
Scale-free graph
xy base clustering arrangements
x y
Copyright 2006, Data Mining Research Laboratory
Similarity Metrics
• Central to any clustering algorithm• Key idea:
– Leverage topological information to determine the similarity between two proteins in the interaction network
– With ensemble approach we are not limited to one!• Metrics :
– Clustering coefficient based (edge oriented, local)– Edge Betweenness based (edge oriented, global)– Neighborhood based (local, non-edge oriented)
Copyright 2006, Data Mining Research Laboratory
Clustering coefficient-based similarity
• Clustering coefficient– "all-my-friends-know-each-other" property
– Measures the interconnectivity of a node’s neighbors.
• Clustering coefficient-based similarity of two connected nodes vi and vj
– Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes
5
1 2
3 4
6
vi vj
Copyright 2006, Data Mining Research Laboratory
Edge betweenness-based similarity
• Shortest path edge betweenness [Newman et al]– “I-am-between-every-pair” property– Computes the fraction of shortest paths passing
through an edge
– Edges that lie between communities have high values of betweenness
– Edge betweenness-based similarity
5
1 2
3 4
6 7
8
Copyright 2006, Data Mining Research Laboratory
Neighborhood-based similarity
• “my-friends-are-your-friends” property• Based on the number of common neighbors
between nodes (Czekanowski-Dice metric [Brun et al, 2004])
where Int(i) = number of neighbors of node i
5
1 2
3 4
6
Copyright 2006, Data Mining Research Laboratory
Base Clustering• Base clustering algorithms : Different criteria
– kMetis – Repeated bisections – Direct k-way partitioning
• Topology-based similarity measures : weight interactions – Clustering coefficient-based – local, targets FP– Edge betweenness-based – global, targets FP– Neighborhood – local, potentially targets FN &
FP
• 3X3 = 9 arrangements (variance is good!)– K clusters per arrangement (K clusters)
Copyright 2006, Data Mining Research Laboratory
PCA-based Consensus Technique
Cluster Purification
Dimensionality Reduction
Consensus Clustering
Copyright 2006, Data Mining Research Laboratory
Cluster Purification
• Goal : Prune unreliable base clusters • Intra-cluster similarity measure
where SP(i,j) represents shortest path between i and j
• Low intra-cluster distance => high reliability
• Remove clusters with low reliability
Copyright 2006, Data Mining Research Laboratory
Dimensionality Reduction
• Cluster membership matrix to represent pruned base clusters
• Dimensions likely to be high (9 X k)• Clustering inefficient for high-dimensional data
– Distance metric computations do not scale well• Lot of noise and redundancy in the matrix• Solution : Reduce dimensions of the matrix
– Apply logistic PCA– Variant of PCA for binary data (Schein et al, 2003)
Copyright 2006, Data Mining Research Laboratory
Consensus Clustering
• Agglomerative Hierarchical Clustering – Bottom-up clustering algorithm– Begin with each point in a separate cluster– Iteratively merge clusters that are similar
• Recursive Bisection (RBR) algorithm• Soft Clustering Variants
– Find initial clusters using agglo or RBR– Assign points to multiple clusters based on similarity
– Hub nodes have high propensity for multiple membership
Copyright 2006, Data Mining Research Laboratory
Base Clustering
Topological Metrics
Weighted GraphCluster Purification
Principal ComponentAnalysis
Final clusters
Base clustering arrangements
Agglomerative Clustering
Weights
Pruning
PCA-agglo PCA-rbr
Ensemble Framework
(Detailed View)
Consensus Clustering
PCA-soft-variants
Soft
Copyright 2006, Data Mining Research Laboratory
Validation Metrics: Domain Independant
• Topological measure : Modularity [Newman&Girvan04]– Measures the modularity within clusters
– dij represents fraction of edges linking nodes in clusters i and j
• Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03]– Measures the shared information between the
consensus and base clustering arrangements
Copyright 2006, Data Mining Research Laboratory
Validation Metric: Domain Dependant
• Domain-based measure:– Gene ontology annotations for each cluster of
proteins• Cellular Component • Molecular Function• Biological Process
– P-value to measure statistical significance of clusters• Computes the probability of the grouping being random• Smaller p-values represent higher biological
significance
– Clustering Score to measure overall clustering arrangement
Copyright 2006, Data Mining Research Laboratory
Experimental Setup
• Algorithms proposed by Strehl et al , 2003– HyperGraph Partitioning Algorithm (HGPA)
• Minimal Hyperedge Separator using HMetis– Meta-CLustering Algorithm (MCLA)
• Group related hyperedges to form meta-clusters• Assign each point to the closest meta-cluster
– Cluster-based Similarity Partitioning (CSPA)• Pairwise similarity matrix is partitioned with METIS
• Algorithms proposed by Gionis et al, ICDE 2005– Agglomerative algorithm (CE-agglo)– Density-based clustering algorithm (CE-balls)– Use strict thresholds and are non-parametric
• Database of Interacting Proteins (DIP)– 4928 proteins, 17194 interactions
Copyright 2006, Data Mining Research Laboratory
Modularity and NMI
CSPA algorithm ran out of memoryCE-agglo and CE-balls algorithms resulted in pairs and singleton clusters(cluster-sizes 2121 and 2783 respectively)
PCA-based consensus methods provide best scores!
Algorithm Modularity NMI
PCA-agglo 0.471 0.66
PCA-rbr 0.46 0.656
MCLA 0.41 0.614
HGPA 0.1 0.275
Copyright 2006, Data Mining Research Laboratory
Comparison with Ensemble Algorithms
Ensemble Algorithms
0
0.1
0.2
0.3
0.40.5
0.6
0.7
0.8
0.9
1
CE-balls CE-agglo HGPA PCA-agglo PCA-rbr MCLA Wt-agglo
Clu
ster
ing
Scor
e
Process
Function
Component
PCA-based Consensus methods outperform all other algorithms!
MCLA performs best of the other algorithms
Copyright 2006, Data Mining Research Laboratory
Existing Solutions to Identify Dense Regions
• Molecular Complex Detection (MCODE)– Bader et al, 2003– Use local neighborhood density to identify seed
vertices– Group highly weighted vertices around seed
vertices• Markov Cluster Algorithm (MCL)
– Dongen et al 2000– Random walks on the graph will infrequently
go from one natural cluster to another – Cluster structure separates out– Fast, scalable and non-parametric
Copyright 2006, Data Mining Research Laboratory
Comparison with MCODE and MCL
• MCODE produced only 59 clusters– Not all proteins clustered (794/4928)– 10-20 clusters insignificant
• MCL produced 1246 clusters– Most of the clusters insignificant (close to 75-80%)
Algorithm Modularity
PCA-agglo
0.471
MCL 0.217
MCODE 0.372
Copyright 2006, Data Mining Research Laboratory
Soft Clustering: Comparison with Hub Duplication (Ucar 2006)For Hub
i++
Hi
Hi
D’iHi
Hub-induced Subgraph Si Dense components of Si
Duplicate Hi
Graph Partitioning
Copyright 2006, Data Mining Research Laboratory
A closer look at soft clustering performance
• CKA1 (hub protein)
Base Algorithm
Annotation PCA-agglo
PCA-softagglo
Direct-bet Kinase CK2 complex Kinase CK2 complex
Kinase CK2 complex
Direct-cc rRNA metabolism rRNA metabolism
RBR-bet Kinase CK2 complex Cell organization and biogenesis
RBR-cc Kinase CK2 complex
Metis-bet Cell organization and biogenesis
Metis-cc
Copyright 2006, Data Mining Research Laboratory
Concluding Remarks
• Clustering PPI networks is challenging
– Noise– Presence of hubs – Need for soft clustering– Integration
• Ensemble clustering shows promise as a unified method to handle these problems
– Competes well against existing stand-alone solutions
– Scalable -- straightforward parallelization for the most part
• Ongoing work– General applicability
• WWW applications• Social network analysis
– Explicit modeling of domain knowledge
• E.g. encoding directionality
– Data Integration• Key is to weight edges and/or
components of the ensemble
– Leveraging graphical models
– More robust base models• Extrinsic similarity measures• Impact of anomalies
Copyright 2006, Data Mining Research Laboratory
Questions?
• We acknowledge the following grants for support
– NSF: CAREER-IIS-0347662 – NSF: NGS-CNS-0406386 – NSF: RI-CNS-0403342 – DOE: ECPI-FG02
• Graduate Student Colleagues– S. Asur and D. Ucar
• Details– http://dmrl.cse.ohio-state.edu– www.cse.ohio-state.edu/~srini/