View
38
Download
0
Category
Tags:
Preview:
DESCRIPTION
Venkatram Ramanathan. Parallelizing a Co-Clustering Application with a Reduction Based Framework on Multi-Core Clusters. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental Evaluation Conclusion. - PowerPoint PPT Presentation
Citation preview
Parallelizing a Co-Clustering Application with a Reduction
Based Framework on Multi-Core Clusters
Venkatram Ramanathan
1
OutlineMotivation
Evolution of Multi-Core Machines and the challenges
Background: MapReduce and FREERIDE
Co-clustering on FREERIDE Experimental EvaluationConclusion
2
Motivation - Evolution Of Multi-Core Machines
Performance Increase: Increased number of cores with lower
clock frequencies Cost Effective Scalability of performance
HPC Environments – Cluster of Multi-Cores
3
Challenges
Multi-Level Parallelism Within Cores in a node – Shared
Memory Parallelism - Pthreads, OpenMP Within Nodes – Distributed Memory
Parallelism - MPI Achieving Programmability and Performance – Major Challenge
4
Challenges
Possible solutionUse higher-level/restricted APIsReduction based APIs
Map-ReduceHigher-level APIProgram Cluster of Multi-Cores with 1 APIExpressive Power Considered Limited
Expressing computations using reduction-based APIs
5
Background
MapReduceMap (in_key,in_value) ->
list(out_key,intermediate_value)Reduce(out_key,list(intermediate_value) -> list(out_value)
FREERIDEUsers explicitly declare Reduction Object
and update itMap and Reduce steps combined Each data element – processed and reduced
before next element is processed6
MapReduce and FREERIDE: Comparison
7
Co-clustering
Involves simultaneous clustering of rows to row clusters and columns to column clusters
Maximizes Mutual Information Uses Kullback-Leibler Divergence
x
xqxpxpqpKL ))()(log()(),(
8
Overview of Co-clustering Algorithm – Preprocessing
9
Overview of Co-clustering Algorithm – Iterative Procedure
10
Parallelizing Co-clustering on FREERIDE
Input matrix and its transpose pre-computed Input matrix and transpose
Divided into files Distributed among nodes Each node - same amount of row and column data
rowCL and colCL – replicated on all nodes Initial clustering
Round robin fashion - consistency across nodes
11
Parallelizing Preprocess Step
In Preprocessing, pX and pY – normalized by total sum
Wait till all nodes process to normalize Each node calculates pX and pY with local data Reduction object updated partial sum, pX and pY
values Accumulated partial sums - total sum pX and pY normalized
xnorm and ynorm calculated in second iteration as they need total sum
12
Parallelizing Preprocess Step
Compressed Matrix of size #rowclusters x #colclusters, calculated with local data Sum of values of values of each row cluster
across each column cluster Final compressed matrix -sum of local
compressed matrices Local compressed matrices – updated in
reduction object Produces final compressed matrix on
accumulation Cluster Centroids calculated
13
Parallelizing Iterative Procedure
Reassign clusteringDetermined by Kullback-Leibler divergence Reduction object updated
Compute compressed matrix Update reduction object
Column Clustering – similar Objective function – finalize Next iteration
14
Parallelizing Co-clustering on FREERIDE
15
Parallelizing Iterative Procedure
16
Experimental Results
Algorithm - same for shared memory, distributed memory and hybrid parallelization
Experiments conducted 2 clusters env1
Intel Xeon E5345 Quad Core Clock Frequency 2.33 GHz Main Memory 6 GB 8 nodes
env2 AMD Opteron 8350 CPU 8 Cores Main Memory 16 GB 4 Nodes
17
Experimental Results
2 Datasets 1 GB Dataset
Matrix Dimensions 16k x 16k 4 GB Dataset
Matrix Dimensions 32k x 32k Datasets and transpose
Split into 32 files each (row partitioning) Distributed among nodes
Number of row and column clusters: 4
18
Experimental Results
19
Experimental Results
20
Experimental Results
21
Experimental Results
Preprocessing stage – bottleneck for smaller dataset – not compute intensive
Speedup with Preprocessing : 12.17 Speedup without Preprocessing: 18.75 Preprocessing stage scales well for Larger
dataset – more computation Speedup is the same with and without
preprocessing. Speedup for larger dataset : 20.7
22
Conclusion
FREERIDE Offers the Following Advantages: No need for loading data in custom file-systems C/C++ based frameworkMuch better performance (comparison for other
algorithms) Co-clusterings can be viewed as
generalized reduction Implementing them on FREERIDE
Speedup of 21 on 32 cores.
23
Recommended