Load Balancing Map-Reduce Communications forEfficient Executions of Applications in a Cloud
A Project Report
Submitted in partial fulfilment of the
requirements for the Degree of
Master of Technology
in
Computational Science
by
Sharat Chandra Racha
Supercomputer Education and Research Centre
Indian Institute of Science
BANGALORE – 560 012
JULY 2012
1
c©Sharat Chandra Racha
JULY 2012
All rights reserved
Acknowledgements
I would like to take this opportunity to express my deepest sense of gratitude and pro-
found feeling of admiration to my thesis supervisor Dr. Sathish Vadhiyar for his patience,
motivation, enthusiasm and immense knowledge. His wise counsel has made my research
experience enriching and rewarding. I would also like to thank Prof. Govindarajan, Dr.
Atanu Mohanty, Dr. Virender Singh, Dr. Shirish Shevade, Dr. Sathish Govindarajan,
and all others who have helped me gain knowledge through the courses that I studied
under them. I would also like to thank Dr. Vijay Natrajan who has helped me in my
project by giving invaluable suggestions during the midterm evaluations. I would like
to thank the SERC department for providing me with the various computing facilities
during coursework and project work.
I would like to acknowledge my colleague Manogna who has provided me with the
much needed technical and emotional support during my stay. I take this opportunity
to thank Rajath, Sameer, Santanu, Cijo, Preeti, Vasudevan and Hari for their help
and support throughout the year. Thanks are also due to my friends Vinay, Abhishek
and Praveen for their constant support, understanding and encouragement. Finally, I
thank my batchmates, friends and juniors for making life in IISc a happy and fulfilling
experience.
I am indebted to my parents and my sister, Bhavana, for the constant support and
encouragement they gave me while pursuing the degree.
i
Abstract
The project explores the use of Hadoop MapReduce framework to execute scientific
workflows in the cloud. Cloud computing provides massive clusters for efficient large
computation and data analysis. MapReduce is a programming model which was first
designed for improving the performance of large batch jobs on cloud computing systems.
One of the most important performance bottlenecks in this model is due to the load im-
balance between the reduce tasks. The input of the reduce tasks is known only after all
the map tasks complete execution and the roles of the reducers are assigned beforehand,
resulting in load imbalance between the reduce tasks. In this project, we use multipro-
cessor scheduling algorithm to assign the roles of the reducers and to minimize the load
imbalance between the reducer tasks, resulting in reduction in the total execution time.
We have obtained results by comparing our algorithm with the default hadoop algorithm
by executing a visualization application, namely, the Out-of-core mesh simplification al-
gorithm, on a hadoop cluster consisting of 20 nodes. Our emulation results show that
our strategies can result in about 10% decrease in total execution time of the application
on a hadoop cluster of up to 1024 nodes.
ii
Contents
Acknowledgements i
Abstract ii
1 Introduction 1
1.1 Cloud Computing and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Hadoop Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 7
2.1 Scheduling Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 MapReduce Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Load Balancing in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methodology 11
3.1 Phase 1 - Sample Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Phase 2 - Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Phase 3 - Real Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Solution to the multiprocessor scheduling problem . . . . . . . . . . . . . 15
4 Application: Mesh Simplification 16
4.1 First layer of map-reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iii
CONTENTS iv
4.2 Second layer of map-reduce . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Experiments and results 19
5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Results of dataset-1 . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Results of dataset-2 . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.3 Results of dataset-3 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Conclusions and Future Work 29
Bibliography 30
List of Figures
1.1 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 MapReduce job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Proposed algorithm for load balancing Map reduce applications . . . . . 12
4.1 Mesh Simplification- Visualization application . . . . . . . . . . . . . . . 17
5.1 Execution times for dataset-1 . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Execution times for dataset-2 . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Execution times for dataset-3 . . . . . . . . . . . . . . . . . . . . . . . . 28
v
List of Tables
5.1 1st layer execution times of dataset-1 . . . . . . . . . . . . . . . . . . . . 22
5.2 2nd layer execution times of dataset-1 . . . . . . . . . . . . . . . . . . . . 22
5.3 Total(1st+2nd layer) execution times of dataset-1 . . . . . . . . . . . . . 23
5.4 1st layer execution times for dataset-2 . . . . . . . . . . . . . . . . . . . 24
5.5 2nd layer execution times for dataset-2 . . . . . . . . . . . . . . . . . . . 25
5.6 Total(1st+2nd layer) execution times for dataset-2 . . . . . . . . . . . . . 25
5.7 1st layer execution times for dataset-3 . . . . . . . . . . . . . . . . . . . 27
5.8 2nd layer execution times for dataset-3 . . . . . . . . . . . . . . . . . . . 27
5.9 Total(1st+2nd layer) execution times for dataset-3 . . . . . . . . . . . . . 27
vi
Chapter 1
Introduction
A workflow is a specification of a set of tasks and the dependencies between them. It is
represented as a directed acyclic graph (DAG) in XML based formats. Each node of the
graph represents an independent task/application and the directed edge represents an
execution dependency among the tasks. For example, there may be a data dependency
between the two tasks (output of one task is the input of other task). Dependencies
among the tasks gives their execution order and dataflow from one task to another.
Some of these tasks may also be parallelizable.
1.1 Cloud Computing and Hadoop
Cloud computing provides easy access to high performance computing and storage in-
frastructure through web services. It provides massive scalability, reliability and config-
urability along with high performance. The cost of running an appplication on a cloud
depends on the computation and the storage resources that are consumed. The perfor-
mance benefits and trade-offs of executing scientific applications in the cloud have been
discussed in [1] and [2].
Hadoop [3] is a framework for running map-reduce applications on the cloud. The
MapReduce [4] is a programming model consisting of two functions- Map and Reduce.
The Map function processes a block of input producing a sequence of (key, value) pairs,
1
CHAPTER 1. INTRODUCTION 2
while the Reduce function processes a set of values associated with a single key.
1.2 Map Reduce
MapReduce provides an abstraction that relies on two operations:
Map: Given input, emit one or more (key, value) pairs.
Reduce: Process all values of a given key and emit one or more (key, value) pairs.
A MapReduce job is composed of three phases as shown in Figure 1.1: map, shuffle and
reduce. In the map phase, each task processes a single block and emits (key, value) pairs.
In the shuffle phase, the system sorts the output of the map phase in parallel, grouping
all values associated with a particular key. In the reduce phase, each reducer processes
all values associated with a given key and emits one or more new key/value pairs.
A typical MapReduce application consists of three functions: map function, parti-
tion function and reduce function. The map function operates on a series of key/value
pairs, processes them and emits output key/value pairs. Each output key/value pair
is allocated to a reducer by the partition function. It takes as input the key and the
total number of reducers and returns the index of the reducer to which the correspond-
ing key/value pair should be sent for furthur processing. The reduce function interates
through the values associated with a unique key and emits the output.
CHAPTER 1. INTRODUCTION 3
Figure 1.1: Map Reduce
1.3 Hadoop Architecture
In Hadoop, there is a single master managing a number of slaves. The master node con-
sists of a JobTracker, TaskTracker, NameNode, and DataNode. A slave or worker node
acts as both a DataNode and TaskTracker, though it is possible to have data-only worker
nodes, and compute-only worker nodes. Namenode holds the filesystem metadata. The
files are broken up and spread over the datanodes. JobTracker schedules and manages
jobs. TaskTracker executes the individual map and reduce functions. If a machine fails,
Hadoop continues to operate the cluster by shifting work to the remaining machines.
The input file, which resides on a distributed filesystem throughout the cluster, is split
into even-sized chunks replicated for fault-tolerance. Hadoop divides each MapReduce
job into a set of tasks. Each chunk of input is processed by a map task, which outputs
a list of key-value pairs. In Hadoop, the shuffle phase occurs as the data is processed by
the mapper. During execution, each mapper hashes the key of each key/value pair into
bins, where each bin is associated with a reducer task and each mapper writes its output
to disk to ensure fault tolerance. Since Hadoop assumes that any mapper is equally
likely to produce any key, each reducer may potentially receive data from any mapper.
The intermediate key/value pair from the map task is passed on to a partitioner which in
CHAPTER 1. INTRODUCTION 4
Figure 1.2: MapReduce job
turn calls the partitioner function as shown in Figure 1.2. It takes as input the key/value
pair and returns the reducer to which this key/value pair should be sent. In Hadoop,
the default partitioner is HashPartitioner, which hashes a records key and modulo the
number of reducers to determine which partition (and thus which reducer) the record
belongs to. The number of partitions is equal to the number of reduce tasks for the job.
The amount of data received from each mapper to a reducer and the total size of data
to be processed by the reduce task will only be known after the map tasks complete
execution. This leads to a load imbalance because the reducer roles are fixed before the
map tasks start in the current implementation.
1.4 Problem Statement
While existing strategies have dealt with scheduling workflow execution on grids, the
same strategies cannot be used for scheduling MapReduce jobs. This is because the edge
weights are predefined in a traditional workflow and thus the size of input to each task
is known before the execution of workflow starts. But this is not the case with MapRe-
duce tasks. So, devising novel strategies is necessary for scheduling mapreduce based
workflows. And moreover, many scientific applications can be designed as MapReduce
applications to make them parallelizable. Hence, effectively improving the execution
CHAPTER 1. INTRODUCTION 5
time of these applications is very useful to the scientific community.
We developed a strategy of deciding the roles of reducer tasks to reduce
load imbalance between the reduce tasks. We first run a sample map-reduce task
over a sample of the data to estimate the distribution of keys. After obtaining the distri-
bution of keys we apply a heuristic, which is used to solve the multiprocessor scheduling
problem, to build a suitable partitioning such that it minimizes the load imbalance among
the reducers.
We have evaluated the proposed optimized technique using a scientific workflow which
consists of multiple MapReduce layer abstractions. It is a real workflow application used
in the field of visualization [5]. It is a mesh simplification algorithm which contains
two layers of MapReduce tasks. We have obtained results by comparing our algorithm
with the default hadoop algorithm by executing a visualization application, namely, the
Out-of-core mesh simplification algorithm, on a hadoop cluster consisting of 20 nodes.
Our emulation results show that our strategies can result in about 10% decrease in total
execution time of the application on a hadoop cluster of up to 1024 nodes. Our strate-
gies not only reduce the time taken for reduce operation in a layer of map-reduce, but
also improves the overall execution time of an application involving multiple map-reduce
layers.
The first map layer bins each vertex into a regular grid and emits (key,value) pairs
for each vertex in each triangle. Here key is the bin coordinates that contains the vertex
and value is the quadric measure vector. The reduce tasks use the quadric measures of
all triangles falling into a bin to compute the representative vertex and emits the triangle
as key, and current grid cell and vertex as value. In the second layer, the map task reads
the data output from first reduce job, keys them on triangle index and emits it and the
reduce job emits the final simplified mesh. Improving the makespan of this application
would be beneficial in the field of visualization where this is used.
CHAPTER 1. INTRODUCTION 6
1.5 Organization
In Chapter 2, we discuss the literature related to our work. The shortcomings of the
already proposed strategies and the reasons why a new strategy is required to approach
the given problem statement are also discussed. Chapter 3 describes the methodology
proposed to reduce the load imbalance among the reducer tasks. The methodology that
is proposed consists of three phases. We obtain the partitioning of the data so that
the entire data is divided into evenly balanced partitions and this is used to execute
the application. Chapter 4 gives a description of the visualization application with
which we have conducted our experiments. The application used is a mesh simplification
application which takes the structured triangular mesh as input and emits the simplified
mesh as output. In Chapter 5, the experimental setup and the results obtained have been
presented. The results have been presented for three datasets. Also we have discussed the
observations for the results of all the three datasets. Chapter 6 discusses the conclusions
of our project and the future work that can be done to extend the work.
Chapter 2
Related Work
This chapter presents the literature related to our work. This chapter discusses about
the strategies related to scheduling workflows, strategies related to the map-reduce opti-
mizations and strategies related to load balancing in map-reduce. It also discusses why
the already proposed methods cannot be used in case of map-reduce applications.
2.1 Scheduling Workflows
Scheduling application components of a workflow onto a grid is a hard problem that has
been studied extensively. The scheduling problem is NP-complete and therefore, most
of the literature deals with finding good heuristic solutions.
In Mandal et al. [6], scheduling is done on grids based on heuristic scheduling strategies
(min-min,max-min and sufferage) that use application component performance models.
They proposed a strategy to bind and launch the application onto heterogeneous re-
sources. The Workflow scheduler uses performance models to determine the run-time
resources needed by an application, and to compute a mapping for different components
that minimizes the application makespan. It obtains a better makespan than the other
existing strategies. It also obtains optimal load balance across the different grid sites.
It is compared with the existing scheduling strategies in grids like random scheduling
without using any performance models and heuristic scheduling with crude performance
7
CHAPTER 2. RELATED WORK 8
models.
Bi-criteria scheduling proposed in [7] discusses a new algorithm called Dynamic Con-
straint Algorithm (DCA) to address the optimization problem of scheduling workflows
in grids with two independent criteria. One is chosen to be the primary criterion and a
sliding constraint is established to determine how much the final solution can differ from
the best solution found for the primary criterion. The dynamic constraint algorithm
is based on dynamic programming and the problem is modelled as an extension of the
multiple-choice knapsack problem. It shows relatively lo scheduling times for workflows
of medium size.
Mandal et al. [8] have discussed fault tolerance techniques like over-provisioning and
checkpoint-recovery combined with scheduling algorithms of HEFT [9] and DSH [10]
to collectively address fault tolerance and scheduling of workflows on grids. Over-
provisioning is a fault-tolerance mechanism where multiple copies of the workflow task
(with the same input data-set) are executed in parallel. HEFT is a list based algorithm
and DSH is a duplication based algorithm. They have presented a study on the effec-
tiveness of the various combinations of the approaches by analyzing their impact on the
reliability of the workflow execution and resource usage under different reliability models
and failure prediction accuracies.
The task-based algorithms, that greedily allocate tasks to resources, and workflow based
algorithms, that search for an efficient allocation for the entire workflow are discussed
in [11]. They have concluded that workflow-based approaches work better for data-
intensive applications even when estimates about future tasks are inaccurate.
The basic difference between a traditional workflow and a map-reduce job is that the
weights of the edges are defined beforehand in the former and are obtained at runtime
in the latter. Due to this, scheduling a workflow can follow scheduling heuristics such as
min-min or max-min or sufferage or HEFT as the edge weights are available. But in a
map-reduce job, since the edge weights are known only after all the map tasks are com-
plete, the same scheduling strategies cannot be used in the mapping of map-reduce jobs
onto the resources. None of these approaches deal with load balancing communications
CHAPTER 2. RELATED WORK 9
among the nodes.
2.2 MapReduce Optimizations
An adaptive scheduling algorithm has been proposed in [12] for dynamic heterogeneous
hadoop systems with the objective to improve mean completion time. It also provides
competitive performance under fairness and locality metrics with respect to the current
Hadoop scheduling algorithms- Fair sharing and FIFO. But basic scheduling algorithms
like FIFO can cause severe performance degradation; particularly in systems that share
data among multiple users. They have proposed an algorithm that is based on cluster
scheduling, and uses system information such as estimated job arrival rates and mean job
execution times to make scheduling decisions. Thus, this work is applicable to systems
whose class of jobs and their arrival rates remain moderately the same at all times.
In [13], Liu et al. have considered the reducer placement problem, which consid-
ers the placement of reducers to minimize the cross-rack traffic. One of the main
performance bottlenecks is caused by the all-to-all communication between mappers
and reducers which may lead to an increase the job execution time. Reducing cross-
rackbcommunication will improve job performance. They have proposed a greedy heuris-
tic which is also the optimal algorithm because of the assumptions made. They have
assumed that each reduce task has the same input size and also each map task produces
equal sized output to each reduce task. But this assumption is highly impractical in real
life situations.
2.3 Load Balancing in MapReduce
In [14], Solomonik et al. have presented an extension of the histogram sorting method
which can be used for large data. Histogram sort comes up with a set of k-1 splitters
to divide the keys into k evenly balanced splits. A splitter is a key that partitions the
CHAPTER 2. RELATED WORK 10
global set of keys at a desired location. If there are r splitters, this technique can be
used to obtain r-1 splitters which divides the hash keys (� the number of reducers) with
their corresponding frequencies into r evenly balanced load sizes for the r reducers. It
uses iterative guessing to find the splitters. These guesses are referred to as probes. The
probe refinement is based on a global histogram. The histogram is calculated by applying
splitters to the actual data. The advantage of this technique is that it is scalable. But
due to the threshold quantity that the algorithm uses, it may lead to a slight imbalance
in the partitions.
In [15] and [16], a heuristic for solving the multiprocessor scheduling problem is
proposed. The LPT heuristic (Longest Processing Time) sorts the jobs by its processing
time in the decreasing order and then assigns them sequentially to the machine with the
earliest end time so far. The multiprocessor tasks considered in this paper are assumed
to be independent, i.e. the precedence relation does not exist among them. They have
also shown that this algorithm achieves an upper bound of 4/3 - 1/(3m) OPT. This
heuristic can be used in the context of load balancing by the following analogy. The
time taken by a task in the multiprocessor scheduling corresponds to the loads of the
hash value and the processors correspond to reducers in our case.
Chapter 3
Methodology
The MapReduce task can be realised as a workflow. It consists of nodes of map tasks
and reduce tasks and there exists dependencies (edges) between the map tasks and the
reduce tasks. In general, there may be all-to-all communication between the map and
reduce tasks. In the default hadoop implementation, the roles of the reducers are fixed
beforehand and the reducer assigned to a (key,value) pair is decided by a hash partitioner.
The hash partitioner is a function which hashes the key to return a value between 0 and
(number of reducer tasks-1) and the (key,value) pair is sent to the corresponding reducer.
In this process, some reducers may receive more key-value pairs and thus there may be an
imbalance between the loads on the reducers. We aim to reduce this imbalance between
reducers by using the multiprocessor scheduling algorithm and assigning the roles of the
reducers through which we gain a reduction in the total execution time.
We have currently developed an algorithm for deciding the roles of reducer tasks
that are spawned to minimize load imbalance between the tasks. It takes the number
of reducers to be spawned as input and comes up with the roles of the reducers. This
method can be split up into three phases.
11
CHAPTER 3. METHODOLOGY 12
Figure 3.1: Proposed algorithm for load balancing Map reduce applications
3.1 Phase 1 - Sample Map-Reduce
In the first phase, we try to find out the distribution of the key values in the (key,value)
pairs of the map output. Since the range of the key values may be very large, we hash
these key values over a certain range. Let us suppose we hash them over a range of 0
to k-1. We now construct a histogram with the k hash values and their corresponding
frequencies (loads). Each hash value is then assigned to one of the r reducers such that
the loads of the reducers are balanced. For this to be possible, the range k should be
much greater than r (k�r) because the indiviual frequencies of the hash values should
be lesser than the average load (total load/number of reducers) of each reducer.
Since constructing a histogram for the whole input data is time consuming and im-
practical, we do this by running the map-reduce job over a sample of the input data
and hence called the sample map-reduce. This sample is chosen randomly and uniformly
across the input data. Simple random sampling is a technique often used in data analysis
to reduce the amount of data to be processed. We have used a sample size of 25% of
the total input size in all the experiments conducted. We choose the sample by parsing
through the input files and passing the records to be processed by the mapper by using
CHAPTER 3. METHODOLOGY 13
a uniform probability distribution. For example, we generate a random float between
0 to 1 and process the record only if the float is between 0.25 and 0.5. In this way,
we can generate a sample of the total input data. We take this sample and construct
a histogram over the hash value range. The histogram constructed is used in the next
phase.
This phase is carried out using all the nodes in the cluster. The final reduce outputs
in the slave nodes of the cluster contains the partial histogram outputs. They contain
frequencies for certain hash key values. All the histogram ouputs are then combined in
the central manager or the master node to give the entire histogram.
3.2 Phase 2 - Load Balancing
We use the histogram data obtained in the first phase to assign each hash key to a re-
ducer by applying the multiprocessor scheduling algorithm on the master also called as
the central manager. The method by which a hash key is assigned to a reducer is ex-
plained in 3.4. After all the hash keys are assigned to some reducer, we have r partitions
of the hash keys such that the summations of their loads are balanced. After obtaining
these partitions, we write them in a file called the partition file. We use this partition
file as an input to the custom partitioner function in the third phase (real map-reduce
task) for deciding which (key,value) pair should be processed by which reducer.
The second phase is executed only in the master node or the central manager. The
histogram data that is required as the input to this phase is also available at the master
node from the previous phase. The output file is a partition file which is written into
the input folders of the data residing on the data nodes.
CHAPTER 3. METHODOLOGY 14
3.3 Phase 3 - Real Map-Reduce
The third phase consists of executing the real map-reduce application to obtain the de-
sired output. The inputs to the reducers are decided by the custom partitioner which
uses the partitions provided in the partition file. Thus our reducer tasks in the map-
reduce application are load balanced and this leads to a reduction in the total execution
time.
The third phase uses all the nodes in the cluster for the execution. The input files
and the partition files are available from the datanodes.
For multi-layer applications, the next layers are executed using the default hadoop
implementation. We expect to see a reduction in the time taken for the execution of the
successive layers because the load balancing in the first layer will lead to a reduction in
load imbalance in the consecutive layers since it uses the ouput files of the reduce tasks
in the previous layer. Since the output files of the reduce tasks are equally balanced
due to load balancing done in that layer, we obtain equal file sizes to process in the
second phase. This leads to an improvement in execution time. Also, the data shuffling
in-between the map and reduce phase in the second layer decreases because of equal
loads of the files that they process. But since we did not take up the reducer placement
problem, we do not have any control over the shuffle bytes and time taken for the shuffle
phase. Still, we expect to see an improvement in time because all the intermediate file
sizes are evenly balanced and thus the time taken for communication may be minimized
because of lesser file sizes.
CHAPTER 3. METHODOLOGY 15
3.4 Solution to the multiprocessor scheduling prob-
lem
The Longest Processing Time (LPT) heuristic is used to solve the multiprocessor schedul-
ing problem for obtaining the partitioning of the k hash keys among the r reducers. After
the first phase, we now have a histogram of the k hash keys with their corresponding
frequencies. The hash keys are sorted in the decreasing order of their frequencies. These
hash keys are now traversed in the sorted order one by one and the hash key is assigned
to the reducer which has minimum load and then the load of the corresponding reducer
is updated. This is done until all the hash keys are assigned to some reducer. We now
arrive at r sets of hash keys, each set standing for a particular reducer. In the actual
mapreduce application, the key value of the output of map function is taken and hashed
over a range of 1 to k. The reducer set which contains this particular hash value would be
assigned to process this (key,value) pair. The description of the algorithm is illustrated
in Algorithm 1.
Algorithm 1 Heuristic algorithm for the multiprocessor scheduling problem
Require: Histogram of the data over the hash range 0 to kWe have r reducers and a variable load for each reducerfor i = 1→ r do
load=0 /* Load of each reducer is 0 initially */end forSort the histogram data in decreasing order of frequency
/*Traverse the sorted histogram data*/for i = 1→ k do
Select reducer having minimum loadAssign the hash key of histogram data to the selected reducerUpdate the load of the selected reducer
end for
Chapter 4
Application: Mesh Simplification
The mesh simplification application that we have used in our experiments consists of two
map-reduce phases as shown in Figure 4.1. It takes as input a structured triangular mesh
and emits a simplified mesh. This is done by superimposing a grid (3-dimensional grid
in 3-dimensional space) over the given triangular mesh and finding the representative
vertex of each grid cell considering all the vertices present in a single grid cell. Every
grid cell which contains a vertex in the input mesh will contain a vertex in the output
mesh and each grid cell will contain only one representative vertex. The problem can
be seen as finding those triangles which span across three different grid cells and finding
a representative vertex for each grid cell which contains at least one vertex. A parallel
implementation of this problem is presented by Silva et al [5]. It is called the Out of core
simplification(OoCS). The time complexity of OoCS is O(n), since it only performs a sin-
gle scan over the mesh file and keeps all the information regarding the quadrics in main
memory. Because of the need to sort several files, OoCSx has time complexity O(n log n).
16
CHAPTER 4. APPLICATION: MESH SIMPLIFICATION 17
Figure 4.1: Mesh Simplification- Visualization application
4.1 First layer of map-reduce
The first map phase takes vertices of a triangle as input and bins each vertex into a grid
cell so that all the vertices of a particular grid cell are sent to the same reducer. The
quadric measure vector associated with the contributing triangle is also calculated. Three
(key,value) pairs are emitted for each triangle. The key is the grid cell that contains the
vertex and value consists of the quadric measure vector of the triangle along with the
three indices of the triangle. The value in the (key, value) pair from the first map phase
contains the indices of the three vertices of the triangle. We can thus calculate the grid
cells into which the three vertices of the triangle fall.
The first reduce phase receives the output of the map phase grouped by key. It uses
the quadric measures [17] of all vertices falling into that grid cell to compute the repre-
sentative vertex of the cell. The calculation of the reprsentative vertex involves solving a
CHAPTER 4. APPLICATION: MESH SIMPLIFICATION 18
3 x 3 linear system of equations to obtain an optimal vertex position that minimizes the
quadric error. That is, it finds the position that minimizes the sum of square volumes of
the tetrahedra (quadric measure) formed by the vertices falling into a single grid cell. If
the three vertices of the triangle fall into different grid cells, then the reduce phase emits
the indexed triangle as key, and concatenation of the grid cell and representative vertex
as the value. If we consider the total output emitted from the first reduce phase, we now
have exactly three (key, value) pairs with the same indexed triangle, i.e., the same key,
having a different representative vertex as they are emitted only if all the three vertices
are present in unique grid cells.
4.2 Second layer of map-reduce
We use the second map-reduce layer to obtain all the three vertices of a single triangle
and emit the triangle as a simplified mesh. The second map phase recieves the output of
the first reduce phase and emits the same (key,value) pair which was keyed on triangle
index. The second reduce phase receives the three (key,value) pairs indexed on the same
triangle and emits it as a single triangle. We obtain a simplified mesh when we combine
the output across all the second layer reduce tasks.
Chapter 5
Experiments and results
5.1 Experiment Setup
Our experimental setup consists of a cluster with 20 nodes. Each node consists of 12x2
dual-core AMD Opteron 2218 based 2.64 GHZ Sun Fire servers, each with 4GB of mem-
ory and 250GB of harddisk space. It runs CentOS release 4.3 and connected by 1GB
ethernet. We use hadoop 0.20.2 for obtaining all the results.
In general, the number of reducers spawned are equal to the maximum number of
machines available for processing. We have simulated the case of having 64, 128, 256,
512 and 1024 reducers on this cluster by considering the maximum time taken by a single
reducer as the time taken for the completion of the reduce phase. We obtain the total
execution time of a single MapReduce layer from the summation of the map phase and
reduce phase. In this way we obtain the execution time of the job, if the job was run on
a cluster consisting of the number of machines equal to the number of reducers spawned.
The application with which we evaluate the algorithm is used in the field of visu-
alization and is called the mesh simplification application which uses the Out-of-core
simplification (OoCSx) algorithm to get a simplified mesh in the MapReduce program-
ming model. It takes the dataset in the form of a triangular mesh as input and emits
19
CHAPTER 5. EXPERIMENTS AND RESULTS 20
the simplified mesh after the execution of two levels of MapReduce jobs. We take the
triangular mesh as input to the sample map-reduce and obtain the histogram of the
hash values (Section 3.1). We use this histogram to get the partitions of the hash values
(Section 3.2). We use the obtained partitions to decide the input to the reduce tasks in
the first layer map-reduce of the application. After receiving the output from the first
layer reduce tasks, the second layer map-reduce is executed using the default hadoop
implementation. In the second layer of map-reduce, 64 map tasks and 64 reduce tasks
are spawned in all the experiments. In the above proposed strategy, we have assumed
that the range of the keys is much greater than the number of machines available(i.e.
the number of reducers that may be spawned). In the visualization application that we
have worked on, the key values are the grid cells in the 3-dimensional space and we have
about 109 grid cells, and we are conducting experiments with upto 1024 reducers
We compare the total execution time of the visualization application using our opti-
mized algorithm with the total execution time of the application when run in the default
hadoop implementation. The total execution time in the latter case consists of the sum-
mation of the time taken for first layer map-reduce and the second layer map-reduce.
The total execution time in the former case can be divided into 3 phases. The time taken
for the first phase is the time taken to execute the sample map-reduce job for obtaining
the histogram (Section 3.1). The second phase consists of running the multiprocessor
scheduling algorithm for obtaining the balanced partitions (Section 3.2). The time taken
for the first and second phases in the optimized case is labelled as P1 and P2 respec-
tively in the tables presented in the results section. Typically, the sample mapreduce
job (P1) is in the order of seconds and load balancing phase (P2) is in the order of tens
of milliseconds.So, we have shown the cumulative time of both phases (P1+P2). The
third phase consists of the execution time of the real map-reduce application, both the
first layer and the second layer, using the obtained partitions in second phase. The time
taken for execution of first layer of real map-reduce in the optimized case is labelled as P3.
CHAPTER 5. EXPERIMENTS AND RESULTS 21
For all our experiments, we used hash value ranges of 0 to 7941 (k) to construct the
histogram. We do only one iteration of the mesh simplification. In this iteration, we
have divided the 3-d space into 1000 parts in each dimension thus giving rise to 109 grid
cells. The experiments are conducted using three different input datasets. Dataset-1 is
happy buddha dataset that is available for download from [18]. Dataset-2 and dataset-3
are synthetically generated data sets. They are generated such that the factors of some
grid cell numbers are heavily loaded thus creating an imbalance when the grid cells are
hash partitioned. To generate these datasets, first we take a dataset already available
for download and run the map phase of the first layer and collect this output. This
output will be in the form of (key, value) pairs. Each vertex in a triangle will emit a
(key, value) pair. Thus three pairs are emitted by processing each triangle. The load
that the reduce layers receive depends on the key values. A key value is a grid cell in
which the vertex falls. As we have explained in chapter 4, all the vertices falling into
a grid cell are processed by a single reducer. Thus we modify some vertices randomly
to fall into certain grid cells. The value in the (key, value) pair contains the vertex
coordinates and the triangle index of the vertex. We then change the vertex coordinates
to fall in the new grid cell by generating random coordinates in that grid cell and then
modifying the coordinates in the triangle by using the triangle index. In this way, we
construct triangles from the (key, value) pairs. We thus get a new dataset which gives a
load imbalanced map phase output. Dataset-2 and dataset-3 have been generated such
that they have different level of load imbalance by changing the probability of a vertex
falling into the grid cells that are heavily loaded. Dataset-3 has lower level of imbalance
when compared to dataset-2.
5.2 Results
5.2.1 Results of dataset-1
We present the execution times of the visualization application which consists of two
map-reduce layers. Table 5.1, 5.2 and 5.3 contains the execution times for running the
CHAPTER 5. EXPERIMENTS AND RESULTS 22
application using dataset-1(happy buddha dataset from [18]) as input. The size of the
dataset is about 500 megabytes. The various break-ups of the total execution time are
presented in these tables. Table 5.1 contains the execution times of the first layer of the
visualization application in both the optimized (using our proposed technique) and the
unoptimized (using the default hadoop implementation of the hash partitioner) method.
Table 5.2 contains the execution times of the second map-reduce layer of the application.
Table 5.3 contains the total execution times of the application in both the optimized and
unoptimized cases and also the percentage improvement in time of optimized method
over the unoptimized one.
Table 5.1: 1st layer execution times of dataset-1
No of reducers 64 128 256 512 1024
Map+reduce time optimized(P3) 213 138 117 111 117
Dummy mapreduce+load balanc-
ing(P1+P2)
50 50 50 50 50
Total time in optimised
case(P1+P2+P3)
263 188 167 161 167
Total time in unoptimised
case
501 327 195 180 126
Table 5.2: 2nd layer execution times of dataset-1
No of reducers 64 128 256 512 1024
2nd layer execution time in opti-
mized case
75 100 166 215 650
2nd layer execution time in unop-
timized case
72 106 175 244 798
CHAPTER 5. EXPERIMENTS AND RESULTS 23
Figure 5.1: Execution times for dataset-1
Table 5.3: Total(1st+2nd layer) execution times of dataset-1
No of reducers 64 128 256 512 1024
1st + 2nd layer execution time in
optimized case
338 288 333 376 817
1st + 2nd layer execution time in
unoptimized case
573 433 370 424 924
Percentage improvement 41.0 33.4 10 11.3 11.5
We observe from Table 5.1 that the time taken only for the first layer map-reduce in
the optimized case (P3) is less than that in the unoptimized case in all the cases. This
indicates that our proposed algorithm is able to load balance the reducers which leads
to a reduction in execution time. Due to the additional overhead of (P1+P2), the total
execution time for the first layer is less in optimized in case of 512 and 1024 reducers.
Table 5.2 shows that the load balancing in the first layer lead to a decrease in the times of
the second layer map-reduce job. This is because the output files of the first map-reduce
CHAPTER 5. EXPERIMENTS AND RESULTS 24
layer are balanced and thus the mappers and reducers in the second layer receive equal
load to execute. In the unoptimized case, the output files are load imbalanced and so
some reducers in second layer take more time to complete. Also, the time taken for the
shuffling of data before the second map phase is seen to be less in the optimized case
than in the unoptimized case. The total time taken for the application (both the layers)
has improved in all the cases. The graphical representation of the execution times of
differet phases is shown in Figure 5.1. The percentage improvement for this dataset is
upto 41%.
5.2.2 Results of dataset-2
Table 5.4, 5.5 and 5.6 contains the execution times for running the application using
dataset-2. The size of the dataset is about 550 megabytes. We observe from Table5.4
that the total time taken for execution of first layer map-reduce in the optimized case is
less than that in the unoptimized case except in the case of 1024 reducers. And there
is a considerable reduction for the time taken to complete second layer map-reduce in
the cae of 128, 256, 512 and 1024 reducers which can be observed from Table5.5. Also
there is an improvement in the total execution times(both the layers combined) in all
the cases. The graphical representation of the execution times is shown in Figure 5.2.
Table 5.4: 1st layer execution times for dataset-2
No of reducers 64 128 256 512 1024
Map+reduce time optimized(P3) 243 249 231 219 219
Dummy mapreduce+load balanc-
ing(P1+P2)
38 38 38 38 38
Total time in optimised
case(P1+P2+P3)
281 287 269 257 257
Total time in unoptimised
case
591 382 307 287 251
CHAPTER 5. EXPERIMENTS AND RESULTS 25
Table 5.5: 2nd layer execution times for dataset-2
No of reducers 64 128 256 512 1024
2nd layer execution time in opti-
mized case
447 420 563 441 733
2nd layer execution time in unop-
timized case
448 447 587 460 901
Table 5.6: Total(1st+2nd layer) execution times for dataset-2
No of reducers 64 128 256 512 1024
1st + 2nd layer execution time in
optimized case
728 707 852 698 990
1st + 2nd layer execution time in
unoptimized case
1039 829 894 747 1152
Percentage improvement 29.9 14.7 6.9 7.6 14.0
5.2.3 Results of dataset-3
Table 5.7, 5.8 and 5.9 contains the execution times for running the application us-
ing dataset-3. The size of dataset-3 is about 550 megabytes. Table 5.7 shows that
the total time taken for the execution of first layer map-reduce in optimized execution
(P1+P2+P3) is less than that in the unoptimized execution in all cases. The total times
for the execution of both the layers shown in Table 5.9 also shows that the optimized
algorithm is faster than the unoptimized one. The graphical representation of the times
is shown in Figure 5.3. We obtain performance improvement of upto 18.8%.
CHAPTER 5. EXPERIMENTS AND RESULTS 26
Figure 5.2: Execution times for dataset-2
CHAPTER 5. EXPERIMENTS AND RESULTS 27
Table 5.7: 1st layer execution times for dataset-3
No of reducers 64 128 256 512 1024
Map+reduce time optimized(P3) 366 372 334 330 333
Dummy mapreduce+load balanc-
ing(P1+P2)
36 36 36 36 36
Total time in optimised
case(P1+P2+P3)
402 408 370 366 369
Total time in unoptimised
case
562 439 391 376 373
Table 5.8: 2nd layer execution times for dataset-3
No of reducers 64 128 256 512 1024
2nd layer execution time in opti-
mized case
835 843 982 1033 1490
2nd layer execution time in unop-
timized case
963 1066 1088 1151 1615
Table 5.9: Total(1st+2nd layer) execution times for dataset-3
No of reducers 64 128 256 512 1024
1st + 2nd layer execution time in
optimized case
1237 1251 1352 1399 1859
1st + 2nd layer execution time in
unoptimized case
1525 1505 1479 1527 1988
Percentage improvement 18.8 16.8 8.5 8.3 6.4
Thus, we are able to get improvement in the total execution times for all the three
data sets that we have experimented with.
CHAPTER 5. EXPERIMENTS AND RESULTS 28
Figure 5.3: Execution times for dataset-3
Chapter 6
Conclusions and Future Work
We conclude that the execution of many MapReduce applications in the default hadoop
implementation of the hash partitioner may give rise to load imbalance between the
reducer tasks due to which the total execution time gets effected. We have identified
that it is one of the main bottlenecks as far as the execution time of the mapreduce job
is concerned. So we have devised a technique with which we load balance the reducers
which leads to a considerable reduction in the total execution time. The percentage
reduction in time is about 10
This work can be extended to conduct experiments by taking different sample sizes for
constructing the histogram and seeing how this may speedup the execution considering
the extra overhead that occurs because of executing a bigger sample size. This work can
also be extended to obtain a mapping of the reducers to the machines in the cluster so
that it takes advantage of the data locality of the intermediate data produced by the
map tasks. This can reduce the intermediate network communication time thus speeding
up the mapreduce application.
29
Bibliography
[1] Deelman, E. and Singh, G. and Livny, M. and Berriman, B. and Good, J The Cost
of doing Science on the Cloud: The Montage Example, International Conference for
High Performance Computing, Networking, Storage and Analysis(SC), 1–12 (2008)
[2] Kondo, D. and Javadi, B. and Malecot, P. and Cappello, F. and Anderson, D.P.,
Cost-benefit analysis of Cloud Computing versus Desktop Grids , IEEE International
Symposium on Parallel and Distributed Processing(IPDPS), 1–12 (2009)
[3] http://hadoop.apache.org/
[4] J. Dean and S.Ghemawat, MapReduce:simplified data processing on large clusters,
Communications of the ACM, 107–113 (2008)
[5] H.T. Vo, J. Bronson and B. Summa and J.L.D. Comba and J. Freire and B. Howe and
V. Pascucci and C.T. Silva, Parallel Visualization on Large Clusters using MapRe-
duce, In Proceedings of the 2011 IEEE Symposium on Large-Scale Data Analysis and
Visualization (LDAV), to appear (2011)
[6] A. Mandal and K. Kennedy and C. Koelbel and G. Marin, J. Mellow-Curmmey and
Bo Liu and L. Johnsson, Scheduling Stragegies for Mapping Application Workflows
onto the Grid, 14th IEEE International Symposium on High Performance Distributed
Computing(HPDC-14), 125–134 (2005).
[7] M. Wieczorek and S. Podlipnig and R. Prodan and T. Fahringer, Bi-criteria Schedul-
ing of Scientific Workflows for the Grid, 8th IEEE International Symposium on Clus-
ter Computing and the Grid(CCGRID), 9–16 (2008).
30
BIBLIOGRAPHY 31
[8] Yang Zhang and A. Mandal and C. Koelbel and K. Cooper, Combined Fault
Tolerance and Scheduling Techniques for Workflow Applications on Computational
Grids, 9th IEEE/ACM International Symposium on Cluster Computing and the
Grid(CCGRID), 244–251 (2009)
[9] H. Topcuouglu and S. Hariri and Min-you Wu, Performance-Effective and Low-
Complexity Task Scheduling for Heterogeneous Computing, IEEE Trans. Parallel Dis-
trib. Syst., 3, 3, 260-274 (2002)
[10] Boontee Kruatrachue and Ted Lewis, Grain Size Determination for Parallel Pro-
cessing, IEE Software, vol 5, no 1, 23-32 (1998)
[11] J. Blythe and S. Jain and E. Deelman and Y. Gil and K. Vahi and A. Mandal and K.
Kennedy, Task Scheduling Strategies for Workflow based Applications in Grids, IEEE
International Symposium on Cluster Computing and the Grid, 2, 759–767 (2005).
[12] Aysan Rasooli and Douglas G. Down, An Adaptive Scheduling Algorithm for Dy-
namic Heterogeneous Hadoop Systems, CASCON (2011)
[13] Li-Yung Ho and Jan-Jan Wu and Pangfeng Liu, Optimal Algorithms for Cross-
Rack Communication Optimization in MapReduce Framework, IEEE International
Conference on Cloud Computing(CLOUD), 420–427 (2011)
[14] Edgar Solomonik and Laxmikant V. Kale, Highly Scalable Parallel Sorting, IEEE In-
ternational Symposium on Parallel and Distributed Processing(IPDPS), 1–12 (2010)
[15] R.L. Graham, Bounds on Multiprocessing Timing Anomalies, SIAM J. Appl. Math.
17, 417–429 (1969)
[16] J.F. Lin and S.J. Chen, Scheduling Algorithm for Nonpreemptive Multiprocessor
Tasks, Computers Math. Applic., vol 28, no 4, 85-92 (1994)
[17] Lindstrom,P., A Memory Insensitive Technique for Large Model Simplification ,
Visualization,2001.VIS’01 Proceedings, 121-550 (2001)
BIBLIOGRAPHY 32
[18] http://graphics.stanford.edu/data/3Dscanrep/