Download pdf - Load Balancing Map-Reduce Communications for E … · Load Balancing Map-Reduce Communications for E cient Executions of Applications in a Cloud A Project Report Submitted in partial

Load Balancing Map-Reduce Communications forEfficient Executions of Applications in a Cloud

A Project Report

Submitted in partial fulfilment of the

requirements for the Degree of

Master of Technology

in

Computational Science

by

Sharat Chandra Racha

Supercomputer Education and Research Centre

Indian Institute of Science

BANGALORE – 560 012

JULY 2012

1

c©Sharat Chandra Racha

JULY 2012

All rights reserved

Acknowledgements

I would like to take this opportunity to express my deepest sense of gratitude and pro-

found feeling of admiration to my thesis supervisor Dr. Sathish Vadhiyar for his patience,

motivation, enthusiasm and immense knowledge. His wise counsel has made my research

experience enriching and rewarding. I would also like to thank Prof. Govindarajan, Dr.

Atanu Mohanty, Dr. Virender Singh, Dr. Shirish Shevade, Dr. Sathish Govindarajan,

and all others who have helped me gain knowledge through the courses that I studied

under them. I would also like to thank Dr. Vijay Natrajan who has helped me in my

project by giving invaluable suggestions during the midterm evaluations. I would like

to thank the SERC department for providing me with the various computing facilities

during coursework and project work.

I would like to acknowledge my colleague Manogna who has provided me with the

much needed technical and emotional support during my stay. I take this opportunity

to thank Rajath, Sameer, Santanu, Cijo, Preeti, Vasudevan and Hari for their help

and support throughout the year. Thanks are also due to my friends Vinay, Abhishek

and Praveen for their constant support, understanding and encouragement. Finally, I

thank my batchmates, friends and juniors for making life in IISc a happy and fulfilling

experience.

I am indebted to my parents and my sister, Bhavana, for the constant support and

encouragement they gave me while pursuing the degree.

i

Abstract

The project explores the use of Hadoop MapReduce framework to execute scientific

workflows in the cloud. Cloud computing provides massive clusters for efficient large

computation and data analysis. MapReduce is a programming model which was first

designed for improving the performance of large batch jobs on cloud computing systems.

One of the most important performance bottlenecks in this model is due to the load im-

balance between the reduce tasks. The input of the reduce tasks is known only after all

the map tasks complete execution and the roles of the reducers are assigned beforehand,

resulting in load imbalance between the reduce tasks. In this project, we use multipro-

cessor scheduling algorithm to assign the roles of the reducers and to minimize the load

imbalance between the reducer tasks, resulting in reduction in the total execution time.

We have obtained results by comparing our algorithm with the default hadoop algorithm

by executing a visualization application, namely, the Out-of-core mesh simplification al-

gorithm, on a hadoop cluster consisting of 20 nodes. Our emulation results show that

our strategies can result in about 10% decrease in total execution time of the application

on a hadoop cluster of up to 1024 nodes.

ii

Contents

Acknowledgements i

Abstract ii

1 Introduction 1

1.1 Cloud Computing and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Hadoop Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work 7

2.1 Scheduling Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 MapReduce Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Load Balancing in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Methodology 11

3.1 Phase 1 - Sample Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Phase 2 - Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Phase 3 - Real Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Solution to the multiprocessor scheduling problem . . . . . . . . . . . . . 15

4 Application: Mesh Simplification 16

4.1 First layer of map-reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

CONTENTS iv

4.2 Second layer of map-reduce . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Experiments and results 19

5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2.1 Results of dataset-1 . . . . . . . . . . . . . . . . . . . . . . . . . . 21



6 Conclusions and Future Work 29

Bibliography 30

List of Figures

1.1 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 MapReduce job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Proposed algorithm for load balancing Map reduce applications . . . . . 12

4.1 Mesh Simplification- Visualization application . . . . . . . . . . . . . . . 17

5.1 Execution times for dataset-1 . . . . . . . . . . . . . . . . . . . . . . . . 23



v

List of Tables

5.1 1st layer execution times of dataset-1 . . . . . . . . . . . . . . . . . . . . 22

5.2 2nd layer execution times of dataset-1 . . . . . . . . . . . . . . . . . . . . 22

5.3 Total(1st+2nd layer) execution times of dataset-1 . . . . . . . . . . . . . 23

5.4 1st layer execution times for dataset-2 . . . . . . . . . . . . . . . . . . . 24

5.5 2nd layer execution times for dataset-2 . . . . . . . . . . . . . . . . . . . 25

5.6 Total(1st+2nd layer) execution times for dataset-2 . . . . . . . . . . . . . 25

5.7 1st layer execution times for dataset-3 . . . . . . . . . . . . . . . . . . . 27

5.8 2nd layer execution times for dataset-3 . . . . . . . . . . . . . . . . . . . 27

5.9 Total(1st+2nd layer) execution times for dataset-3 . . . . . . . . . . . . . 27

vi

Chapter 1

Introduction

A workflow is a specification of a set of tasks and the dependencies between them. It is

represented as a directed acyclic graph (DAG) in XML based formats. Each node of the

graph represents an independent task/application and the directed edge represents an

execution dependency among the tasks. For example, there may be a data dependency

between the two tasks (output of one task is the input of other task). Dependencies

among the tasks gives their execution order and dataflow from one task to another.

Some of these tasks may also be parallelizable.

1.1 Cloud Computing and Hadoop

Cloud computing provides easy access to high performance computing and storage in-

frastructure through web services. It provides massive scalability, reliability and config-

urability along with high performance. The cost of running an appplication on a cloud

depends on the computation and the storage resources that are consumed. The perfor-

mance benefits and trade-offs of executing scientific applications in the cloud have been

discussed in [1] and [2].

Hadoop [3] is a framework for running map-reduce applications on the cloud. The

MapReduce [4] is a programming model consisting of two functions- Map and Reduce.

The Map function processes a block of input producing a sequence of (key, value) pairs,

1

CHAPTER 1. INTRODUCTION 2

while the Reduce function processes a set of values associated with a single key.

1.2 Map Reduce

MapReduce provides an abstraction that relies on two operations:

Map: Given input, emit one or more (key, value) pairs.

Reduce: Process all values of a given key and emit one or more (key, value) pairs.

A MapReduce job is composed of three phases as shown in Figure 1.1: map, shuffle and

reduce. In the map phase, each task processes a single block and emits (key, value) pairs.

In the shuffle phase, the system sorts the output of the map phase in parallel, grouping

all values associated with a particular key. In the reduce phase, each reducer processes

all values associated with a given key and emits one or more new key/value pairs.

A typical MapReduce application consists of three functions: map function, parti-

tion function and reduce function. The map function operates on a series of key/value

pairs, processes them and emits output key/value pairs. Each output key/value pair

is allocated to a reducer by the partition function. It takes as input the key and the

total number of reducers and returns the index of the reducer to which the correspond-

ing key/value pair should be sent for furthur processing. The reduce function interates

through the values associated with a unique key and emits the output.


Figure 1.1: Map Reduce

1.3 Hadoop Architecture

In Hadoop, there is a single master managing a number of slaves. The master node con-

sists of a JobTracker, TaskTracker, NameNode, and DataNode. A slave or worker node

acts as both a DataNode and TaskTracker, though it is possible to have data-only worker

nodes, and compute-only worker nodes. Namenode holds the filesystem metadata. The

files are broken up and spread over the datanodes. JobTracker schedules and manages

jobs. TaskTracker executes the individual map and reduce functions. If a machine fails,

Hadoop continues to operate the cluster by shifting work to the remaining machines.

The input file, which resides on a distributed filesystem throughout the cluster, is split

into even-sized chunks replicated for fault-tolerance. Hadoop divides each MapReduce

job into a set of tasks. Each chunk of input is processed by a map task, which outputs

a list of key-value pairs. In Hadoop, the shuffle phase occurs as the data is processed by

the mapper. During execution, each mapper hashes the key of each key/value pair into

bins, where each bin is associated with a reducer task and each mapper writes its output

to disk to ensure fault tolerance. Since Hadoop assumes that any mapper is equally

likely to produce any key, each reducer may potentially receive data from any mapper.

The intermediate key/value pair from the map task is passed on to a partitioner which in


Figure 1.2: MapReduce job

turn calls the partitioner function as shown in Figure 1.2. It takes as input the key/value

pair and returns the reducer to which this key/value pair should be sent. In Hadoop,

the default partitioner is HashPartitioner, which hashes a records key and modulo the

number of reducers to determine which partition (and thus which reducer) the record

belongs to. The number of partitions is equal to the number of reduce tasks for the job.

The amount of data received from each mapper to a reducer and the total size of data

to be processed by the reduce task will only be known after the map tasks complete

execution. This leads to a load imbalance because the reducer roles are fixed before the

map tasks start in the current implementation.

1.4 Problem Statement

While existing strategies have dealt with scheduling workflow execution on grids, the

same strategies cannot be used for scheduling MapReduce jobs. This is because the edge

weights are predefined in a traditional workflow and thus the size of input to each task

is known before the execution of workflow starts. But this is not the case with MapRe-

duce tasks. So, devising novel strategies is necessary for scheduling mapreduce based

workflows. And moreover, many scientific applications can be designed as MapReduce

applications to make them parallelizable. Hence, effectively improving the execution


time of these applications is very useful to the scientific community.

We developed a strategy of deciding the roles of reducer tasks to reduce

load imbalance between the reduce tasks. We first run a sample map-reduce task

over a sample of the data to estimate the distribution of keys. After obtaining the distri-

bution of keys we apply a heuristic, which is used to solve the multiprocessor scheduling

problem, to build a suitable partitioning such that it minimizes the load imbalance among

the reducers.

We have evaluated the proposed optimized technique using a scientific workflow which

consists of multiple MapReduce layer abstractions. It is a real workflow application used

in the field of visualization [5]. It is a mesh simplification algorithm which contains

two layers of MapReduce tasks. We have obtained results by comparing our algorithm

with the default hadoop algorithm by executing a visualization application, namely, the

Out-of-core mesh simplification algorithm, on a hadoop cluster consisting of 20 nodes.

Our emulation results show that our strategies can result in about 10% decrease in total

execution time of the application on a hadoop cluster of up to 1024 nodes. Our strate-

gies not only reduce the time taken for reduce operation in a layer of map-reduce, but

also improves the overall execution time of an application involving multiple map-reduce

layers.

The first map layer bins each vertex into a regular grid and emits (key,value) pairs

for each vertex in each triangle. Here key is the bin coordinates that contains the vertex

and value is the quadric measure vector. The reduce tasks use the quadric measures of

all triangles falling into a bin to compute the representative vertex and emits the triangle

as key, and current grid cell and vertex as value. In the second layer, the map task reads

the data output from first reduce job, keys them on triangle index and emits it and the

reduce job emits the final simplified mesh. Improving the makespan of this application

would be beneficial in the field of visualization where this is used.


1.5 Organization

In Chapter 2, we discuss the literature related to our work. The shortcomings of the

already proposed strategies and the reasons why a new strategy is required to approach

the given problem statement are also discussed. Chapter 3 describes the methodology

proposed to reduce the load imbalance among the reducer tasks. The methodology that

is proposed consists of three phases. We obtain the partitioning of the data so that

the entire data is divided into evenly balanced partitions and this is used to execute

the application. Chapter 4 gives a description of the visualization application with

which we have conducted our experiments. The application used is a mesh simplification

application which takes the structured triangular mesh as input and emits the simplified

mesh as output. In Chapter 5, the experimental setup and the results obtained have been

presented. The results have been presented for three datasets. Also we have discussed the

observations for the results of all the three datasets. Chapter 6 discusses the conclusions

of our project and the future work that can be done to extend the work.

Chapter 2

Related Work

This chapter presents the literature related to our work. This chapter discusses about

the strategies related to scheduling workflows, strategies related to the map-reduce opti-

mizations and strategies related to load balancing in map-reduce. It also discusses why

the already proposed methods cannot be used in case of map-reduce applications.

2.1 Scheduling Workflows

Scheduling application components of a workflow onto a grid is a hard problem that has

been studied extensively. The scheduling problem is NP-complete and therefore, most

of the literature deals with finding good heuristic solutions.

In Mandal et al. [6], scheduling is done on grids based on heuristic scheduling strategies

(min-min,max-min and sufferage) that use application component performance models.

They proposed a strategy to bind and launch the application onto heterogeneous re-

sources. The Workflow scheduler uses performance models to determine the run-time

resources needed by an application, and to compute a mapping for different components

that minimizes the application makespan. It obtains a better makespan than the other

existing strategies. It also obtains optimal load balance across the different grid sites.

It is compared with the existing scheduling strategies in grids like random scheduling

without using any performance models and heuristic scheduling with crude performance

7

CHAPTER 2. RELATED WORK 8

models.

Bi-criteria scheduling proposed in [7] discusses a new algorithm called Dynamic Con-

straint Algorithm (DCA) to address the optimization problem of scheduling workflows

in grids with two independent criteria. One is chosen to be the primary criterion and a

sliding constraint is established to determine how much the final solution can differ from

the best solution found for the primary criterion. The dynamic constraint algorithm

is based on dynamic programming and the problem is modelled as an extension of the

multiple-choice knapsack problem. It shows relatively lo scheduling times for workflows

of medium size.

Mandal et al. [8] have discussed fault tolerance techniques like over-provisioning and

checkpoint-recovery combined with scheduling algorithms of HEFT [9] and DSH [10]

to collectively address fault tolerance and scheduling of workflows on grids. Over-

provisioning is a fault-tolerance mechanism where multiple copies of the workflow task

(with the same input data-set) are executed in parallel. HEFT is a list based algorithm

and DSH is a duplication based algorithm. They have presented a study on the effec-

tiveness of the various combinations of the approaches by analyzing their impact on the

reliability of the workflow execution and resource usage under different reliability models

and failure prediction accuracies.

The task-based algorithms, that greedily allocate tasks to resources, and workflow based

algorithms, that search for an efficient allocation for the entire workflow are discussed

in [11]. They have concluded that workflow-based approaches work better for data-

intensive applications even when estimates about future tasks are inaccurate.

The basic difference between a traditional workflow and a map-reduce job is that the

weights of the edges are defined beforehand in the former and are obtained at runtime

in the latter. Due to this, scheduling a workflow can follow scheduling heuristics such as

min-min or max-min or sufferage or HEFT as the edge weights are available. But in a

map-reduce job, since the edge weights are known only after all the map tasks are com-

plete, the same scheduling strategies cannot be used in the mapping of map-reduce jobs

onto the resources. None of these approaches deal with load balancing communications


among the nodes.

2.2 MapReduce Optimizations

An adaptive scheduling algorithm has been proposed in [12] for dynamic heterogeneous

hadoop systems with the objective to improve mean completion time. It also provides

competitive performance under fairness and locality metrics with respect to the current

Hadoop scheduling algorithms- Fair sharing and FIFO. But basic scheduling algorithms

like FIFO can cause severe performance degradation; particularly in systems that share

data among multiple users. They have proposed an algorithm that is based on cluster

scheduling, and uses system information such as estimated job arrival rates and mean job

execution times to make scheduling decisions. Thus, this work is applicable to systems

whose class of jobs and their arrival rates remain moderately the same at all times.

In [13], Liu et al. have considered the reducer placement problem, which consid-

ers the placement of reducers to minimize the cross-rack traffic. One of the main

performance bottlenecks is caused by the all-to-all communication between mappers

and reducers which may lead to an increase the job execution time. Reducing cross-

rackbcommunication will improve job performance. They have proposed a greedy heuris-

tic which is also the optimal algorithm because of the assumptions made. They have

assumed that each reduce task has the same input size and also each map task produces

equal sized output to each reduce task. But this assumption is highly impractical in real

life situations.

2.3 Load Balancing in MapReduce

In [14], Solomonik et al. have presented an extension of the histogram sorting method

which can be used for large data. Histogram sort comes up with a set of k-1 splitters

to divide the keys into k evenly balanced splits. A splitter is a key that partitions the


global set of keys at a desired location. If there are r splitters, this technique can be

used to obtain r-1 splitters which divides the hash keys (� the number of reducers) with

their corresponding frequencies into r evenly balanced load sizes for the r reducers. It

uses iterative guessing to find the splitters. These guesses are referred to as probes. The

probe refinement is based on a global histogram. The histogram is calculated by applying

splitters to the actual data. The advantage of this technique is that it is scalable. But

due to the threshold quantity that the algorithm uses, it may lead to a slight imbalance

in the partitions.

In [15] and [16], a heuristic for solving the multiprocessor scheduling problem is

proposed. The LPT heuristic (Longest Processing Time) sorts the jobs by its processing

time in the decreasing order and then assigns them sequentially to the machine with the

earliest end time so far. The multiprocessor tasks considered in this paper are assumed

to be independent, i.e. the precedence relation does not exist among them. They have

also shown that this algorithm achieves an upper bound of 4/3 - 1/(3m) OPT. This

heuristic can be used in the context of load balancing by the following analogy. The

time taken by a task in the multiprocessor scheduling corresponds to the loads of the

hash value and the processors correspond to reducers in our case.

Chapter 3

Methodology

The MapReduce task can be realised as a workflow. It consists of nodes of map tasks

and reduce tasks and there exists dependencies (edges) between the map tasks and the

reduce tasks. In general, there may be all-to-all communication between the map and

reduce tasks. In the default hadoop implementation, the roles of the reducers are fixed

beforehand and the reducer assigned to a (key,value) pair is decided by a hash partitioner.

The hash partitioner is a function which hashes the key to return a value between 0 and

(number of reducer tasks-1) and the (key,value) pair is sent to the corresponding reducer.

In this process, some reducers may receive more key-value pairs and thus there may be an

imbalance between the loads on the reducers. We aim to reduce this imbalance between

reducers by using the multiprocessor scheduling algorithm and assigning the roles of the

reducers through which we gain a reduction in the total execution time.

We have currently developed an algorithm for deciding the roles of reducer tasks

that are spawned to minimize load imbalance between the tasks. It takes the number

of reducers to be spawned as input and comes up with the roles of the reducers. This

method can be split up into three phases.

11

CHAPTER 3. METHODOLOGY 12

Figure 3.1: Proposed algorithm for load balancing Map reduce applications

3.1 Phase 1 - Sample Map-Reduce

In the first phase, we try to find out the distribution of the key values in the (key,value)

pairs of the map output. Since the range of the key values may be very large, we hash

these key values over a certain range. Let us suppose we hash them over a range of 0

to k-1. We now construct a histogram with the k hash values and their corresponding

frequencies (loads). Each hash value is then assigned to one of the r reducers such that

the loads of the reducers are balanced. For this to be possible, the range k should be

much greater than r (k�r) because the indiviual frequencies of the hash values should

be lesser than the average load (total load/number of reducers) of each reducer.

Since constructing a histogram for the whole input data is time consuming and im-

practical, we do this by running the map-reduce job over a sample of the input data

and hence called the sample map-reduce. This sample is chosen randomly and uniformly

across the input data. Simple random sampling is a technique often used in data analysis

to reduce the amount of data to be processed. We have used a sample size of 25% of

the total input size in all the experiments conducted. We choose the sample by parsing

through the input files and passing the records to be processed by the mapper by using


a uniform probability distribution. For example, we generate a random float between

0 to 1 and process the record only if the float is between 0.25 and 0.5. In this way,

we can generate a sample of the total input data. We take this sample and construct

a histogram over the hash value range. The histogram constructed is used in the next

phase.

This phase is carried out using all the nodes in the cluster. The final reduce outputs

in the slave nodes of the cluster contains the partial histogram outputs. They contain

frequencies for certain hash key values. All the histogram ouputs are then combined in

the central manager or the master node to give the entire histogram.

3.2 Phase 2 - Load Balancing

We use the histogram data obtained in the first phase to assign each hash key to a re-

ducer by applying the multiprocessor scheduling algorithm on the master also called as

the central manager. The method by which a hash key is assigned to a reducer is ex-

plained in 3.4. After all the hash keys are assigned to some reducer, we have r partitions

of the hash keys such that the summations of their loads are balanced. After obtaining

these partitions, we write them in a file called the partition file. We use this partition

file as an input to the custom partitioner function in the third phase (real map-reduce

task) for deciding which (key,value) pair should be processed by which reducer.

The second phase is executed only in the master node or the central manager. The

histogram data that is required as the input to this phase is also available at the master

node from the previous phase. The output file is a partition file which is written into

the input folders of the data residing on the data nodes.


3.3 Phase 3 - Real Map-Reduce

The third phase consists of executing the real map-reduce application to obtain the de-

sired output. The inputs to the reducers are decided by the custom partitioner which

uses the partitions provided in the partition file. Thus our reducer tasks in the map-

reduce application are load balanced and this leads to a reduction in the total execution

time.

The third phase uses all the nodes in the cluster for the execution. The input files

and the partition files are available from the datanodes.

For multi-layer applications, the next layers are executed using the default hadoop

implementation. We expect to see a reduction in the time taken for the execution of the

successive layers because the load balancing in the first layer will lead to a reduction in

load imbalance in the consecutive layers since it uses the ouput files of the reduce tasks

in the previous layer. Since the output files of the reduce tasks are equally balanced

due to load balancing done in that layer, we obtain equal file sizes to process in the

second phase. This leads to an improvement in execution time. Also, the data shuffling

in-between the map and reduce phase in the second layer decreases because of equal

loads of the files that they process. But since we did not take up the reducer placement

problem, we do not have any control over the shuffle bytes and time taken for the shuffle

phase. Still, we expect to see an improvement in time because all the intermediate file

sizes are evenly balanced and thus the time taken for communication may be minimized

because of lesser file sizes.


3.4 Solution to the multiprocessor scheduling prob-

lem

The Longest Processing Time (LPT) heuristic is used to solve the multiprocessor schedul-

ing problem for obtaining the partitioning of the k hash keys among the r reducers. After

the first phase, we now have a histogram of the k hash keys with their corresponding

frequencies. The hash keys are sorted in the decreasing order of their frequencies. These

hash keys are now traversed in the sorted order one by one and the hash key is assigned

to the reducer which has minimum load and then the load of the corresponding reducer

is updated. This is done until all the hash keys are assigned to some reducer. We now

arrive at r sets of hash keys, each set standing for a particular reducer. In the actual

mapreduce application, the key value of the output of map function is taken and hashed

over a range of 1 to k. The reducer set which contains this particular hash value would be

assigned to process this (key,value) pair. The description of the algorithm is illustrated

in Algorithm 1.

Algorithm 1 Heuristic algorithm for the multiprocessor scheduling problem

Require: Histogram of the data over the hash range 0 to kWe have r reducers and a variable load for each reducerfor i = 1→ r do

load=0 /* Load of each reducer is 0 initially */end forSort the histogram data in decreasing order of frequency

/*Traverse the sorted histogram data*/for i = 1→ k do

Select reducer having minimum loadAssign the hash key of histogram data to the selected reducerUpdate the load of the selected reducer

end for

Chapter 4

Application: Mesh Simplification

The mesh simplification application that we have used in our experiments consists of two

map-reduce phases as shown in Figure 4.1. It takes as input a structured triangular mesh

and emits a simplified mesh. This is done by superimposing a grid (3-dimensional grid

in 3-dimensional space) over the given triangular mesh and finding the representative

vertex of each grid cell considering all the vertices present in a single grid cell. Every

grid cell which contains a vertex in the input mesh will contain a vertex in the output

mesh and each grid cell will contain only one representative vertex. The problem can

be seen as finding those triangles which span across three different grid cells and finding

a representative vertex for each grid cell which contains at least one vertex. A parallel

implementation of this problem is presented by Silva et al [5]. It is called the Out of core

simplification(OoCS). The time complexity of OoCS is O(n), since it only performs a sin-

gle scan over the mesh file and keeps all the information regarding the quadrics in main

memory. Because of the need to sort several files, OoCSx has time complexity O(n log n).

16

CHAPTER 4. APPLICATION: MESH SIMPLIFICATION 17

Figure 4.1: Mesh Simplification- Visualization application

4.1 First layer of map-reduce

The first map phase takes vertices of a triangle as input and bins each vertex into a grid

cell so that all the vertices of a particular grid cell are sent to the same reducer. The

quadric measure vector associated with the contributing triangle is also calculated. Three

(key,value) pairs are emitted for each triangle. The key is the grid cell that contains the

vertex and value consists of the quadric measure vector of the triangle along with the

three indices of the triangle. The value in the (key, value) pair from the first map phase

contains the indices of the three vertices of the triangle. We can thus calculate the grid

cells into which the three vertices of the triangle fall.

The first reduce phase receives the output of the map phase grouped by key. It uses

the quadric measures [17] of all vertices falling into that grid cell to compute the repre-

sentative vertex of the cell. The calculation of the reprsentative vertex involves solving a

CHAPTER 4. APPLICATION: MESH SIMPLIFICATION 18

3 x 3 linear system of equations to obtain an optimal vertex position that minimizes the

quadric error. That is, it finds the position that minimizes the sum of square volumes of

the tetrahedra (quadric measure) formed by the vertices falling into a single grid cell. If

the three vertices of the triangle fall into different grid cells, then the reduce phase emits

the indexed triangle as key, and concatenation of the grid cell and representative vertex

as the value. If we consider the total output emitted from the first reduce phase, we now

have exactly three (key, value) pairs with the same indexed triangle, i.e., the same key,

having a different representative vertex as they are emitted only if all the three vertices

are present in unique grid cells.

4.2 Second layer of map-reduce

We use the second map-reduce layer to obtain all the three vertices of a single triangle

and emit the triangle as a simplified mesh. The second map phase recieves the output of

the first reduce phase and emits the same (key,value) pair which was keyed on triangle

index. The second reduce phase receives the three (key,value) pairs indexed on the same

triangle and emits it as a single triangle. We obtain a simplified mesh when we combine

the output across all the second layer reduce tasks.

Chapter 5

Experiments and results

5.1 Experiment Setup

Our experimental setup consists of a cluster with 20 nodes. Each node consists of 12x2

dual-core AMD Opteron 2218 based 2.64 GHZ Sun Fire servers, each with 4GB of mem-

ory and 250GB of harddisk space. It runs CentOS release 4.3 and connected by 1GB

ethernet. We use hadoop 0.20.2 for obtaining all the results.

In general, the number of reducers spawned are equal to the maximum number of

machines available for processing. We have simulated the case of having 64, 128, 256,

512 and 1024 reducers on this cluster by considering the maximum time taken by a single

reducer as the time taken for the completion of the reduce phase. We obtain the total

execution time of a single MapReduce layer from the summation of the map phase and

reduce phase. In this way we obtain the execution time of the job, if the job was run on

a cluster consisting of the number of machines equal to the number of reducers spawned.

The application with which we evaluate the algorithm is used in the field of visu-

alization and is called the mesh simplification application which uses the Out-of-core

simplification (OoCSx) algorithm to get a simplified mesh in the MapReduce program-

ming model. It takes the dataset in the form of a triangular mesh as input and emits

19

CHAPTER 5. EXPERIMENTS AND RESULTS 20

the simplified mesh after the execution of two levels of MapReduce jobs. We take the

triangular mesh as input to the sample map-reduce and obtain the histogram of the

hash values (Section 3.1). We use this histogram to get the partitions of the hash values

(Section 3.2). We use the obtained partitions to decide the input to the reduce tasks in

the first layer map-reduce of the application. After receiving the output from the first

layer reduce tasks, the second layer map-reduce is executed using the default hadoop

implementation. In the second layer of map-reduce, 64 map tasks and 64 reduce tasks

are spawned in all the experiments. In the above proposed strategy, we have assumed

that the range of the keys is much greater than the number of machines available(i.e.

the number of reducers that may be spawned). In the visualization application that we

have worked on, the key values are the grid cells in the 3-dimensional space and we have

about 109 grid cells, and we are conducting experiments with upto 1024 reducers

We compare the total execution time of the visualization application using our opti-

mized algorithm with the total execution time of the application when run in the default

hadoop implementation. The total execution time in the latter case consists of the sum-

mation of the time taken for first layer map-reduce and the second layer map-reduce.

The total execution time in the former case can be divided into 3 phases. The time taken

for the first phase is the time taken to execute the sample map-reduce job for obtaining

the histogram (Section 3.1). The second phase consists of running the multiprocessor

scheduling algorithm for obtaining the balanced partitions (Section 3.2). The time taken

for the first and second phases in the optimized case is labelled as P1 and P2 respec-

tively in the tables presented in the results section. Typically, the sample mapreduce

job (P1) is in the order of seconds and load balancing phase (P2) is in the order of tens

of milliseconds.So, we have shown the cumulative time of both phases (P1+P2). The

third phase consists of the execution time of the real map-reduce application, both the

first layer and the second layer, using the obtained partitions in second phase. The time

taken for execution of first layer of real map-reduce in the optimized case is labelled as P3.


For all our experiments, we used hash value ranges of 0 to 7941 (k) to construct the

histogram. We do only one iteration of the mesh simplification. In this iteration, we

have divided the 3-d space into 1000 parts in each dimension thus giving rise to 109 grid

cells. The experiments are conducted using three different input datasets. Dataset-1 is

happy buddha dataset that is available for download from [18]. Dataset-2 and dataset-3

are synthetically generated data sets. They are generated such that the factors of some

grid cell numbers are heavily loaded thus creating an imbalance when the grid cells are

hash partitioned. To generate these datasets, first we take a dataset already available

for download and run the map phase of the first layer and collect this output. This

output will be in the form of (key, value) pairs. Each vertex in a triangle will emit a

(key, value) pair. Thus three pairs are emitted by processing each triangle. The load

that the reduce layers receive depends on the key values. A key value is a grid cell in

which the vertex falls. As we have explained in chapter 4, all the vertices falling into

a grid cell are processed by a single reducer. Thus we modify some vertices randomly

to fall into certain grid cells. The value in the (key, value) pair contains the vertex

coordinates and the triangle index of the vertex. We then change the vertex coordinates

to fall in the new grid cell by generating random coordinates in that grid cell and then

modifying the coordinates in the triangle by using the triangle index. In this way, we

construct triangles from the (key, value) pairs. We thus get a new dataset which gives a

load imbalanced map phase output. Dataset-2 and dataset-3 have been generated such

that they have different level of load imbalance by changing the probability of a vertex

falling into the grid cells that are heavily loaded. Dataset-3 has lower level of imbalance

when compared to dataset-2.

5.2 Results

5.2.1 Results of dataset-1

We present the execution times of the visualization application which consists of two

map-reduce layers. Table 5.1, 5.2 and 5.3 contains the execution times for running the


application using dataset-1(happy buddha dataset from [18]) as input. The size of the

dataset is about 500 megabytes. The various break-ups of the total execution time are

presented in these tables. Table 5.1 contains the execution times of the first layer of the

visualization application in both the optimized (using our proposed technique) and the

unoptimized (using the default hadoop implementation of the hash partitioner) method.

Table 5.2 contains the execution times of the second map-reduce layer of the application.

Table 5.3 contains the total execution times of the application in both the optimized and

unoptimized cases and also the percentage improvement in time of optimized method

over the unoptimized one.

Table 5.1: 1st layer execution times of dataset-1

No of reducers 64 128 256 512 1024

Map+reduce time optimized(P3) 213 138 117 111 117

Dummy mapreduce+load balanc-

ing(P1+P2)

50 50 50 50 50

Total time in optimised

case(P1+P2+P3)

263 188 167 161 167

Total time in unoptimised

case

501 327 195 180 126

Table 5.2: 2nd layer execution times of dataset-1

No of reducers 64 128 256 512 1024

2nd layer execution time in opti-

mized case

75 100 166 215 650

2nd layer execution time in unop-

timized case

72 106 175 244 798


Figure 5.1: Execution times for dataset-1

Table 5.3: Total(1st+2nd layer) execution times of dataset-1

No of reducers 64 128 256 512 1024

1st + 2nd layer execution time in

optimized case

338 288 333 376 817


unoptimized case

573 433 370 424 924

Percentage improvement 41.0 33.4 10 11.3 11.5

We observe from Table 5.1 that the time taken only for the first layer map-reduce in

the optimized case (P3) is less than that in the unoptimized case in all the cases. This

indicates that our proposed algorithm is able to load balance the reducers which leads

to a reduction in execution time. Due to the additional overhead of (P1+P2), the total

execution time for the first layer is less in optimized in case of 512 and 1024 reducers.

Table 5.2 shows that the load balancing in the first layer lead to a decrease in the times of

the second layer map-reduce job. This is because the output files of the first map-reduce


layer are balanced and thus the mappers and reducers in the second layer receive equal

load to execute. In the unoptimized case, the output files are load imbalanced and so

some reducers in second layer take more time to complete. Also, the time taken for the

shuffling of data before the second map phase is seen to be less in the optimized case

than in the unoptimized case. The total time taken for the application (both the layers)

has improved in all the cases. The graphical representation of the execution times of

differet phases is shown in Figure 5.1. The percentage improvement for this dataset is

upto 41%.


Table 5.4, 5.5 and 5.6 contains the execution times for running the application using

dataset-2. The size of the dataset is about 550 megabytes. We observe from Table5.4

that the total time taken for execution of first layer map-reduce in the optimized case is

less than that in the unoptimized case except in the case of 1024 reducers. And there

is a considerable reduction for the time taken to complete second layer map-reduce in

the cae of 128, 256, 512 and 1024 reducers which can be observed from Table5.5. Also

there is an improvement in the total execution times(both the layers combined) in all

the cases. The graphical representation of the execution times is shown in Figure 5.2.

Table 5.4: 1st layer execution times for dataset-2

No of reducers 64 128 256 512 1024



ing(P1+P2)

38 38 38 38 38


case(P1+P2+P3)

281 287 269 257 257


case

591 382 307 287 251


Table 5.5: 2nd layer execution times for dataset-2

No of reducers 64 128 256 512 1024


mized case

447 420 563 441 733


timized case

448 447 587 460 901

Table 5.6: Total(1st+2nd layer) execution times for dataset-2

No of reducers 64 128 256 512 1024


optimized case

728 707 852 698 990


unoptimized case

1039 829 894 747 1152

Percentage improvement 29.9 14.7 6.9 7.6 14.0


Table 5.7, 5.8 and 5.9 contains the execution times for running the application us-

ing dataset-3. The size of dataset-3 is about 550 megabytes. Table 5.7 shows that

the total time taken for the execution of first layer map-reduce in optimized execution

(P1+P2+P3) is less than that in the unoptimized execution in all cases. The total times

for the execution of both the layers shown in Table 5.9 also shows that the optimized

algorithm is faster than the unoptimized one. The graphical representation of the times

is shown in Figure 5.3. We obtain performance improvement of upto 18.8%.




Table 5.7: 1st layer execution times for dataset-3

No of reducers 64 128 256 512 1024



ing(P1+P2)

36 36 36 36 36


case(P1+P2+P3)

402 408 370 366 369


case

562 439 391 376 373

Table 5.8: 2nd layer execution times for dataset-3

No of reducers 64 128 256 512 1024


mized case

835 843 982 1033 1490


timized case

963 1066 1088 1151 1615

Table 5.9: Total(1st+2nd layer) execution times for dataset-3

No of reducers 64 128 256 512 1024


optimized case

1237 1251 1352 1399 1859


unoptimized case

1525 1505 1479 1527 1988

Percentage improvement 18.8 16.8 8.5 8.3 6.4

Thus, we are able to get improvement in the total execution times for all the three

data sets that we have experimented with.



Chapter 6

Conclusions and Future Work

We conclude that the execution of many MapReduce applications in the default hadoop

implementation of the hash partitioner may give rise to load imbalance between the

reducer tasks due to which the total execution time gets effected. We have identified

that it is one of the main bottlenecks as far as the execution time of the mapreduce job

is concerned. So we have devised a technique with which we load balance the reducers

which leads to a considerable reduction in the total execution time. The percentage

reduction in time is about 10

This work can be extended to conduct experiments by taking different sample sizes for

constructing the histogram and seeing how this may speedup the execution considering

the extra overhead that occurs because of executing a bigger sample size. This work can

also be extended to obtain a mapping of the reducers to the machines in the cluster so

that it takes advantage of the data locality of the intermediate data produced by the

map tasks. This can reduce the intermediate network communication time thus speeding

up the mapreduce application.

29

Bibliography

[1] Deelman, E. and Singh, G. and Livny, M. and Berriman, B. and Good, J The Cost

of doing Science on the Cloud: The Montage Example, International Conference for

High Performance Computing, Networking, Storage and Analysis(SC), 1–12 (2008)

[2] Kondo, D. and Javadi, B. and Malecot, P. and Cappello, F. and Anderson, D.P.,

Cost-benefit analysis of Cloud Computing versus Desktop Grids , IEEE International

Symposium on Parallel and Distributed Processing(IPDPS), 1–12 (2009)

[3] http://hadoop.apache.org/

[4] J. Dean and S.Ghemawat, MapReduce:simplified data processing on large clusters,

Communications of the ACM, 107–113 (2008)

[5] H.T. Vo, J. Bronson and B. Summa and J.L.D. Comba and J. Freire and B. Howe and

V. Pascucci and C.T. Silva, Parallel Visualization on Large Clusters using MapRe-

duce, In Proceedings of the 2011 IEEE Symposium on Large-Scale Data Analysis and

Visualization (LDAV), to appear (2011)

[6] A. Mandal and K. Kennedy and C. Koelbel and G. Marin, J. Mellow-Curmmey and

Bo Liu and L. Johnsson, Scheduling Stragegies for Mapping Application Workflows

onto the Grid, 14th IEEE International Symposium on High Performance Distributed

Computing(HPDC-14), 125–134 (2005).

[7] M. Wieczorek and S. Podlipnig and R. Prodan and T. Fahringer, Bi-criteria Schedul-

ing of Scientific Workflows for the Grid, 8th IEEE International Symposium on Clus-

ter Computing and the Grid(CCGRID), 9–16 (2008).

30

BIBLIOGRAPHY 31

[8] Yang Zhang and A. Mandal and C. Koelbel and K. Cooper, Combined Fault

Tolerance and Scheduling Techniques for Workflow Applications on Computational

Grids, 9th IEEE/ACM International Symposium on Cluster Computing and the

Grid(CCGRID), 244–251 (2009)

[9] H. Topcuouglu and S. Hariri and Min-you Wu, Performance-Effective and Low-

Complexity Task Scheduling for Heterogeneous Computing, IEEE Trans. Parallel Dis-

trib. Syst., 3, 3, 260-274 (2002)

[10] Boontee Kruatrachue and Ted Lewis, Grain Size Determination for Parallel Pro-

cessing, IEE Software, vol 5, no 1, 23-32 (1998)

[11] J. Blythe and S. Jain and E. Deelman and Y. Gil and K. Vahi and A. Mandal and K.

Kennedy, Task Scheduling Strategies for Workflow based Applications in Grids, IEEE

International Symposium on Cluster Computing and the Grid, 2, 759–767 (2005).

[12] Aysan Rasooli and Douglas G. Down, An Adaptive Scheduling Algorithm for Dy-

namic Heterogeneous Hadoop Systems, CASCON (2011)

[13] Li-Yung Ho and Jan-Jan Wu and Pangfeng Liu, Optimal Algorithms for Cross-

Rack Communication Optimization in MapReduce Framework, IEEE International

Conference on Cloud Computing(CLOUD), 420–427 (2011)

[14] Edgar Solomonik and Laxmikant V. Kale, Highly Scalable Parallel Sorting, IEEE In-

ternational Symposium on Parallel and Distributed Processing(IPDPS), 1–12 (2010)

[15] R.L. Graham, Bounds on Multiprocessing Timing Anomalies, SIAM J. Appl. Math.

17, 417–429 (1969)

[16] J.F. Lin and S.J. Chen, Scheduling Algorithm for Nonpreemptive Multiprocessor

Tasks, Computers Math. Applic., vol 28, no 4, 85-92 (1994)

[17] Lindstrom,P., A Memory Insensitive Technique for Large Model Simplification ,

Visualization,2001.VIS’01 Proceedings, 121-550 (2001)

BIBLIOGRAPHY 32

[18] http://graphics.stanford.edu/data/3Dscanrep/