18
MapReduce in Cloud Computing Mohammad Mustaqeem M.Tech 2 nd Year Computer Science and Engineering Reg. No: 2011CS17 Department of Computer Science and Engineering Motilal Nehru National Institute of Technology Allahabad

MapReduce in Cloud Computing

Embed Size (px)

Citation preview

MapReduce in Cloud Computing

Mohammad MustaqeemM.Tech 2nd Year

Computer Science and EngineeringReg. No: 2011CS17

Department of Computer Science and Engineering

Motilal Nehru National Institute of Technology

Allahabad

Contents

1 Introduction 1

1.1 Map and Reduce in Functional Programming . . . . . . . . . . . . . . . . . . 1

1.2 Structure of MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . 1

2 Motivations 2

3 Description of First Paper 2

3.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.2 Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.1 Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . 3

3.2.2 MapReduce Programming Model . . . . . . . . . . . . . . . . . . . . 4

3.3 An Example : Word Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Description of Second Paper 8

4.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Approach used to Tackle the Issue . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2.3 System Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Integration of both Papers 14

6 Conclusion 14

List of Figures

1 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Execution phase of a generic MapReduce application . . . . . . . . . . . . . 5

3 Word Count Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 System model described through the UML Class Diagram. . . . . . . . . . . 9

5 Behaviour of a generic node described by an UML State Diagram. . . . . . . 12

6 General Architecture of P2P-MapReduce. . . . . . . . . . . . . . . . . . . . . 13

3

1 Introduction

Cloud computing is designed to provide on demand resources or services over the Internet,usually at the scale and with the reliability level of a data center. MapReduce is a softwareframework that allows developers to write programs that process massive amounts of unstruc-tured data in parallel across a distributed cluster of processors or stand-alone computers. Itwas developed at Google for indexing Web pages.

The model is inspired by the map and reduce functions commonly used in functionalprogramming(like LISP, scheme, racket etc.),[3] although their purpose in the MapReduceframework is not the same as their original forms.

1.1 Map and Reduce in Functional Programming

• Map: The structure of map function in Racket is -

(map f list1)→ list2 [4]

where f is a function and, list1 and list2 are lists.

It applies function f to the elements of list1 and gives a list list2 containing results of fin order.

e.g. (map (lambda (x)(* x x)) ’(1 2 3 4 5))→ ’(1 4 9 16 25)

• Reduce: There are two variations of Reduce function in Racket.Their structure are -

(foldl f init list1)→ any

and

(foldl f init list1)→ any [4]

Like map, foldl applies a function to the elements of one or more lists. Whereas map combinesthe return values into a list, foldl combines the return values in an arbitrary way that isdetermined by f.In foldl, list1 is traversed from left to right while in foldr, list1 is traversedfrom right to left.

e.g. (foldl - 0 ’(1 2 3 4 5 6))→ 3

(foldr - 0 ’(1 2 3 4 5 6))→ -3

1.2 Structure of MapReduce Framework

The framework is divided into two parts:

• Map: It distributes out work to different nodes in the distributed cluster.

1

• Reduce: It collects the work and resolves the results into a single value.

The MapReduce Framework is fault-tolerant because each node in the cluster is expectedto report back periodically with completed work and status updates. If a node remains silentfor longer than the expected interval, a master node makes note and re-assigns the work toother nodes.

2 Motivations

The computations that process large amount of raw data such as crawled documents, webrequest logs etc. to compute various kinds of derived data, such as inverted indices, vari-ous representations of the graph structure of web documents, summaries of the number ofpages crawled per host, the set of most frequent queries in a given day, etc., are very com-plex. Most such computations are conceptually straightforward. However, the input data isusually large and the computations have to be distributed across hundreds or thousands ofmachines(cluster) in order to finish in a reasonable amount of time. Most of the time, somemachines may fail during computation. So, we required such a solution that cope well withthese issues.

MapReduce framework are able to handle these issues like how to parallelize the compu-tation, distribute the data, and handle failures of various nodes during computation. Besidethese features, writing MapReduce programs is very easy. Programmers have to just definethe two function i.e. map and reduce. Rest of the work is done by the MapReduce framework.

3 Description of First Paper

Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing”

3.1 Issues

In cloud computing, all the commodity hardware need to process enormous amount of datathat can’t be handle by single machine. The real life examples of such processing areReverseWeb-Link Graph, web access analysis, Term-Vector per Host, the inverted indexclustering, Count of URL Access Frequency, Distributed Sort etc [3]. Because of size of thesedata, we need to process it parallely in distributed manner on large clusters of machine sothat the processing can be done in reasonable amount of time.

2

3.2 Approach used to Tackle the Issue

Hadoop is an open source Java framework for processing and querying vast amounts ofdata on large clusters of commodity hardware(cloud) and have been applied in many sitessuch as Amazon, Facebook and Yahoo etc. [1]. It takes advantage of distributed systeminfrastructure and process enormous amount of data in almost real time. It can also tacklethe node failure because it keep multiple copies of data.

Hadoop has mainly two components - MapReduce and Hadoop Distributed File System(HDFS) [1].

3.2.1 Hadoop Distributed File System

HDFS provides the underlying support for distributed storage. Like traditional File System,we can make, delete, rename the files and directory. But these files and directories arestored in distributed fashion among the nodes. In HDFS, there are two types of nodes -Name Node and Data Node [1]. Name Node provides the data services while Data Nodeprovides actual storage. Hadoop cluster contains only one Name Node and multiple DataNodes. In HDFS, files are divided into blocks which are copied to multiple Data Nodes toprovide reliable File System. The HDFS architecture is shown below -

Figure 1: HDFS Architecture

• Name Node - Name Node is a process that runs on separate machine. It providesall the data services that is file system management and maintaining the file systemtree. In reality, Name Node stores only the meta-data of the files and directories.While programming, programmer doesn’t need the actual location of the files but itcan access the files through the Name Node. Name Node does all the underlying workfor the users.

3

• Data Node - Data Node is a process that runs on individual machines of the cluster.The file blocks are stored in the local file system of these nodes. These nodes periodicallysends the meta-data of the stored blocks to the Name Node. Client can directly writesthe blocks to the Data Node. After writing, deleting, copying the blocks, the DataNodes informs to the Name Node.

The sequence of operations to write a file in HDFS are -

1. Client send request to write a file to the Name Node.

2. According to file size and file block configuration, NameNode returned file informationof its management section to the Client.

3. Client divide files into multiple blocks. According to Data Node address information,Client writes the blocks to Data Nodes.

3.2.2 MapReduce Programming Model

MapReduce is the key concept behind the Hadoop. It is widely recognized as the mostimportant programming model for Cloud computing. MapReduce is a technique for dividingwork across a distributed system.

In MapReduce programming model, users have to define only two functions - a map anda reduce function.

The map function processes a (key, value) pair and returns a list of (intermediate key,value) pairs:

map (k1, v1) → list(k2, v2).

The reduce function merges a intermediate values having the same intermediate key:

reduce (k2, list(v2) → list(v3).

Execution phase of a generic MapReduce application - Following sequence ofactions occur when the user submits a MapReduce Job:

1. The MapReduce library in the user program first splits the input files into M pieces.The size of these pieces is range from 16 MB to 64 MB. It then starts copying thesepieces into multiple machines of the cluster. Then all the software program started.

2. Among these program one is master and others are workers or salves. There are totalM map tasks and R reduce tasks. Firstly, master picks idle workers and assigns a mapor reduce task.

4

Figure 2: Execution phase of a generic MapReduce application

3. Map task reads the contents of the corresponding input splits. It process key-valuepairs of the input data and passes each pair to the user-defined Map function. Theintermediate key-value pairs produced are buffered in the memory.

4. The buffered pairs are written to local disk and the location of these pairs are passedback to master. Then master forwards these memory locations to reduce workers.

5. When reduce worker gets these memory locations, it uses remote procedure calls toread data from map worker. After reading all these intermediate pairs, it reduce workersorts it by the intermediate keys so that all the occurrences of the same key are groupedtogether.

6. For each intermediate key, the user defined Reduce function is applied to the corre-sponding intermediate values. Finally, the output of the Reduce function is appendedto the final output file.

7. When all map tasks and reduce tasks have been completed, the master wakes up theuser program. At this point, the MapReduce call in the user program returns back tothe user code.

After successful execution of these steps, the output is stored in R output files(one per reducetask).

5

3.3 An Example : Word Count

A simple MapReduce program can be written to determine how many times different wordsappear in a set of files.

Let the content of a file is -

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Whole MapReduce process is depicted in the given figure -

Figure 3: Word Count Execution

1. The MapReduce library splits the file content into three parts -

6

After splitting the data, it starts up many copies of the program on a cluster of ma-chines.

2. Master copies the map task to 3 map worker.

The code of the map function is like -

mapper (filename, file-contents):

for each word in file-contents:

emit (word, 1)

3. Map function is applied to each slit that generate following intermediate key-value pairs-

4. When map worker is done, it reports to the master and gives the memory location ofthe output.

5. When all the mapper task is done, the master starts reducer task on the idle machinesand gives the memory location from where reduce worker starts copying the interme-diate key-value pairs.

6. After receiving all the intermediate key-value pairs, reduce worker sorts these pairs togroup the pairs on the basis of intermediate keys.

7. At this point, the reduce function is applied to the intermediate key-value pair. Thepseudocode of Reduce function is -

reducer (word, values):

sum = 0

for each value in values:

sum = sum + value

emit (word, sum)

7

8. The final output of the Reduce function is -

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

4 Description of Second Paper

Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Paralleldata processing in dynamic Cloud environments”

4.1 Issues

MapReduce is a development model that allows developers to write programs that processmassive amounts of unstructured data in parallel across a distributed cluster of machines.In a cloud, the nodes may leave and join at runtime.So, we required such a system thatcan handle such conditions. The MapReduce that is discussed so far is based on centralizedarchitecture and it can’t also tackle with the dynamic infrastructure, in which nodes mayjoin and leave the network at high rates. This paper describes an adaptive P2P-MapReducesystem that can handle the situation in which master node may fail.

4.2 Approach used to Tackle the Issue

The main goal of P2P-MapReduce is to give such a infrastructure in which nodes may joinand leave the cluster without effecting the MapReduce functionality. This is required becausein cloud environment, there is high levels of churn. To achieve this goal, P2P-MapReduceadopts a peer-to-peer model in which a wide set of autonomous nodes can act either as masteror as a slave. The master and slave are interchanging to each other dynamically in such away that the ratio between the number of masters to the slaves remains constant.

In P2P-MapReduce, to prevent the lose of computation in case of Master failure, there isare some backup masters for each masters. The master responsible for a job J, referred to as

8

the primary master for J, dynamically updates the job state on its backup nodes, which arereferred to as the backup masters for J. If at some instant, a primary master fails, its placeis taken by one of its backup masters.

4.2.1 System Model

Here, the system model of P2P-MapReduce describes the characteristics of jobs, tasks, users,and nodes at abstract level. The UML class diagram is given below:

Figure 4: System model described through the UML Class Diagram.

• Job: A job can be modelled as following:

job = 〈jobId, code, input, output,M,R〉

where jobId is a job identifier, code includes the map and reduce functions, input andoutput represent the locations of the input and output data respectively, M and R arethe number of map tasks and reduce task respectively.

• Task: A task can be modelled as following:

9

task = 〈taskId, jobId, type, code, input, output〉

where taskId and jobId are task identifier and job identifier respectively, type can beeither MAP or REDUCE, code represents the map or reduce function(depending onthe task type), and input & output represents the location of the input & output dataof the task.

• User: A user is modelled as a pair of the form:

user = 〈userId, userJobList〉

where userId is user identifier and userJobList is the list of jobs submitted by the user.

• Node: A node has following tuples:

node = 〈nodeId, role, primaryJobList, backupJobList, slaveTaskList〉

where nodeId represents the node identifier, the node’s role(MASTER or SLAVE) isidentified by the role tuple, primaryJobList is the list of jobs managed by the node,backupJobList is the list of jobs of whom it is acting as a backup Master, slaveTaskList isempty if the node’s role is MASTER otherwise it contains the list of (map or reduce)taskassigned to it.

• PrimaryJobType: It has following tuples:

primaryJobType = 〈job, userId, jobStatus, jobTaskList, backupMasterList〉

where job is a job descriptor, userId is the user identifier, jobStatus is the current statusof job, jobTaskList is the list of tasks contains in job, backupMasterList is the list ofbackup Masters of the primary job.

• JobTaskType: JobTaskType has following tuples:

jobTaskType = 〈task, slaveId, taskStatus〉

where task is a task descriptor, slaveId is the identifier of the slave node responsible forthe task and taskStatus is current status of the task.

• BackupJobType: The backupJobList contains tuples of a backupJobType defined as:

backupJobType = 〈job, userId, jobStatus, jobTaskList, backupMasterList〉

10

BackupJobType differs from primaryJobType for the presence of an additional field,primaryId, which represents the identifier of the primary master associated with thejob.

• slaveTaskType: SlaveTaskType has following tuples:

slaveTaskType = 〈task, primaryId, taskStatus〉

where task is a task descriptor, primaryId is the identifier of the primary master asso-ciated with the task, and taskStatus contains its status.

4.2.2 Architecture

There are three types of node in P2P-MapReduce architecture i.e. user, master and slave.Master nodes and Slave nodes form two two logical peer-to-peer networks referred to as M-netand S-net, respectively. The composition of the M-net and S-net are changing dynamicallybecause as earlier described, the role of master node and slave node are interchanging.User node submits the MapReduce job to one of the available master nodes. The selectionof master node is done by current workload of the available master nodes.Master nodes are at the core of the system. They perform three types of operations: man-agement, recovery and coordination. A master node that is acting as primary master for oneor more jobs, executes the management operation. A master node that is acting as backupmaster for one or more jobs, executes the recovery operation. The coordinator operationchanges slaves into masters and vice-versa, so as to keep the desired master/slave ratio.

The slave executes tasks that are assigned to it by one or more primary masters.

Jobs and tasks are managed by process called Job Managers and Task Managers respec-tively. For each managed jobs, primary master runs one Job Manager while slave runs onetask Manager for each assigned task. In addition to this, masters also runs Backup JobManager for each job they are responsible for a backup masters.

4.2.3 System Mechanism

The behaviour of a generic node can be easily understood by UML state diagram. With thestates, it also gives the events by which the state of the node changes. UML state diagramof a node in P2P-MapReduce architecture is given below:

11

Figure 5: Behaviour of a generic node described by an UML State Diagram.

The state diagram shows two macro-states, SLAVE and MASTER, which is the two rolethat can a node has. The SLAVE macro-state has three states, IDLE, CHECK MASTERand ACTIVE, which represent respectively: a slave waiting for task assignment; a slavechecking the existence of at least one master in the network; a slave executing one or moretasks. The MASTER macro-state is modelled with three parallel macro-states, which rep-resent the different roles a master can perform concurrently: possibly acting as the primarymaster for one or more jobs (MANAGEMENT); possibly acting as a backup master for oneor more jobs (RECOVERY); coordinating the network for maintenance purposes (COORDI-NATION). The MANAGEMENT macro-state contains two states: NOT PRIMARY, whichrepresents a master node currently not acting as the primary master for any job, and PRI-MARY, which, in contrast, represents a master node currently managing at least one jobas the primary master. Similarly, the RECOVERY macro-state includes two states: NOTBACKUP (the node is not managing any job as backup master) and BACKUP (at least onejob is currently being backed up on this node). Finally, the COORDINATION macro-stateincludes four states: NOT COORDINATOR (the node is not acting as the coordinator),COORDINATOR (the node is acting as the coordinator), WAITING COORDINATOR andELECTING COORDINATOR for nodes currently participating to the election of the newcoordinator, as specified later. The combination of the concurrent states [NOT PRIMARY,NOT BACKUP, NOT COORDINATOR] represents the abstract state MASTER.IDLE. The

12

transition from master to slave role is allowed only to masters in the MASTER.IDLE state.Similarly, the transition from slave to master role is allowed to slaves that are not in ACTIVEstate.

4.3 Example

Whole system mechanism can be understood by a simple example which is described byfollowing figure:

Figure 6: General Architecture of P2P-MapReduce.

Figure 6 shows that total three jobs have been submitted: one job by User1(Job1) andtwo jobs by User2(Job2 and Job3). For Job1, Node1 is primary master, and Node2 & Node3are backup masters. Job1 is composed by five tasks: two of them are assigned to Node4, andone each to Node7, Node9 and Node11.

The following recovery procedure takes place when a primary master Node1 fails:

• Backup masters Node2 and Node3 detect the failure of Node1 and start a distributedprocedure to elect the new primary master among them.

• Assuming that Node3 is elected as the new primary master, Node2 continues to playthe backup function and, to keep the desired number of backup masters active (two,

13

in this example), another ackup node is chosen by Node3. Then, Node3 binds to theconnections that were previously associated with Node1, and proceeds to manage thejob using its local replica of the job state.

As soon as the job is completed, the (new) primary master notifies the result to the usernode that submitted the managed job.

5 Integration of both Papers

First Paper Second PaperIssues To perform data-intensive com-

putation in Cloud environment inreasonable amount of time.

To design a peer-to-peer MapRe-duce system that can handle allthe node’s failure including Mas-ter node’s failure.

Approaches Used Simple MapReduce(presented byGoogle) implementation is used.MapReduce is based on theMaster-Slave Model. This imple-mentation is known as Hadoop.

Peer-to-peer architecture is usedto handle all the dynamic churnsin a cluster.

Advantages Hadoop is scalable, reliable anddistributed able to handle enor-mous amount of data. It can pro-cess big data in real time.

P2P-MapReduce can managenode churn, master failures andjob recovery in an effective way.

Table 1: Comparison between two papers.

6 Conclusion

MapReduce is a scalable, reliable and exploits the distributed system to perform efficiently incloud environment. P2P-MapReduce is a novel approach to handle the real world problemsfaced by data-intensive computing. The P2P-MapReduce is more reliable than the MapRe-duce framework because it is able to manage node churn, master failures, and job recovery ina decentralized but effective way. Thus, cloud-based programming model will be the futuretrends in the programming field.

14

References

[1] Gaizhen Yang, ”The Application of MapReduce in the Cloud Computing”, InternationalSymposium on Intelligence Information Processing and Trusted Computing (IPTC), Oc-tober 2011, pp. 154-156, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6103560.

[2] Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, ”P2P-MapReduce: Parallel data pro-cessing in dynamic Cloud environments”, Journal of Computer and System Sciences,vol. 78, Issue 5 September 212, pp. 1382-1402, http://dl.acm.org/citation.cfm?id=2240494.

[3] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on largeclusters, OSDI’04 Proceedings of the 6th conference on Symposium on OpeartingSystems Design & Implementation, vol. 6, 2004, pp.10-10, www.usenix.org/event/

osdi04/tech/full_papers/dean/dean.pdf and http://dl.acm.org/citation.cfm?

id=1251254.1251264.

[4] The Racket Guide http://docs.racket-lang.org/guide/.

[5] Hadoop Tutorial - YDN http://developer.yahoo.com/hadoop/tutorial/module4.

html.

[6] http://readwrite.com/2012/10/15/why-the-future-of-software-and-apps-is-serverless.

[7] F. Marozzo, D. Talia, P. Trunfio., ”A Peer-to-Peer Framework for Supporting MapReduceApplications in Dynamic Cloud Environments”, In: N. Antonopoulos, L. Gillam (eds.),Cloud Computing: Principles, Systems and Applications, Springer, Chapter 7, 113-125,2010.

[8] IBM developer work, Using MapReduce and load balancing on the cloud http://www.

ibm.com/developerworks/cloud/library/cl-mapreduce/.

15