7
MapReduce: MR Model Abstraction for Future Security Study Ibrahim Lahmer, Ning Zhang School of Computer Science The University of Manchester Manchester, M13 9PL, UK [email protected] [email protected] ABSTRACT MapReduce is a new parallel programming paradigm pro- posed to process large amount of data in a distributed set- ting. Since its introduction, there have been efforts to im- prove the architecture of this model making it more efficient, secure and scalable. In parallel with these developments, there are also efforts to implement and deploy MapReduce, and one of its most popular open source implementation is Hadoop. Some more recent practical MapReduce imple- mentations have made architectural changes to the original MapReduce model, e.g., those by Facebook and IBM. These architectural changes may have implications to the design of solutions to secure MapReduce. To inform these changes and to serve any future design of such solutions, this paper attempts to build a generic MapReduce computation model capturing the main features and properties of the most re- cent MapReduce implementations. Keywords MR: MapReduce, DFS: Distributed File System, YARN: Yet Another Resource Negotiator Master node, Resource Man- ager, JobTracker, TaskTrackers, ,NameNode, DataNodes. 1. INTRODUCTION MapReduce , as a parallel and distributed programming framework, is becoming more and more widely used [17]. Hadoop, an implemetation of the MapReduce framework, has been adopted by many companies, including the major IT players in the world, such as Facebook, eBay, IBM and Yahoo [2] [16]. Recently, Facebook engineers team [5] have claimed that they developed a new MapReduce topology called Corona. Also, Yahoo developers [10] and IBM devel- opers [11] have also announced that they have developed and used a new MapReduce implementation, YARN. These new implementations of the MapReduce use architectures that are different from the one used by Hadoop, the previous implementation. For example, in Corona and YARN, the functions played by a single master node in Hadoop is split into two sets, respectively, played by two master nodes, a Re- source Manager and a JobTracker. In other words, Corona and YARN use two masters to perform the functions which are performed by the single master in Hadoop. While, these architectural changes in the new MapReduce implementa- tions bring a number of benefits such as improving perfor- mances and reducing the chance of creating bottlenecks in the system [5][11]. They also have implications on other aspects of the system such as security provisioning. For example, these changes will alter the way different system components interact, and how and from which entities vari- ous requests and responses are generated. These will lead to changes in the risk landscape in the system, and therefore how security should be provided. However, existing security solution designs for MapReduce have not taken into account these architectural changes in the new MapReduce imple- mentations. For example, the encryption service has been designed by Jason and Subatra (2013). They examined their proposed security service for the old MapReduce framework version [1]. The scheme encrypts data which is uploaded by the client and stored in DFS. Furthermore, Somu, N. and et al. (2014) proposed another security service [12]. They have designed an authentication service by which the client can be authenticated to the MapReeduce framework. Their design schemes show that they perform well in term of security and performance in the old MapReduce version. Their designs assume that, there is only one master node (JobTracker) in which the client authenticate to and submit his job. This master node also retrieves the input data splits which are up- loaded to the DFS, and allocates the TaskTrackers (Worker nodes). However, in the current MapReduce framework im- plementation, it is no longer the case, there are two master nodes, one master node, Resource manager, is responsible for job submission, providing job IDs and the paths of job files to a clients, then the client uploads his input data files into the DFS. The other master node, JobTracker, retrieves input splits which are uploaded by the client to the DFS, and allocate the TaskTrackers which are assigned by the Re- source Manager. This raise concerns in term of security as the way of MR components interact and accessing the data stored in the DFS. An example of these concerns, in the new MR framework implementation the client authenticates to the Resource manager rather than JobTracker and his input splits are retrieved by the other master node. This means that the JobTracker also needs proper secret keys to have access to the DFS and retrieves the input splits when the

MapReduce: MR Model Abstraction for Future Security Study

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce: MR Model Abstraction for Future Security Study

MapReduce: MR Model Abstraction for Future SecurityStudy

Ibrahim Lahmer, Ning ZhangSchool of Computer ScienceThe University of Manchester

Manchester, M13 9PL, [email protected]

[email protected]

ABSTRACTMapReduce is a new parallel programming paradigm pro-posed to process large amount of data in a distributed set-ting. Since its introduction, there have been efforts to im-prove the architecture of this model making it more efficient,secure and scalable. In parallel with these developments,there are also efforts to implement and deploy MapReduce,and one of its most popular open source implementationis Hadoop. Some more recent practical MapReduce imple-mentations have made architectural changes to the originalMapReduce model, e.g., those by Facebook and IBM. Thesearchitectural changes may have implications to the designof solutions to secure MapReduce. To inform these changesand to serve any future design of such solutions, this paperattempts to build a generic MapReduce computation modelcapturing the main features and properties of the most re-cent MapReduce implementations.

KeywordsMR: MapReduce, DFS: Distributed File System, YARN: YetAnother Resource Negotiator Master node, Resource Man-ager, JobTracker, TaskTrackers, ,NameNode, DataNodes.

1. INTRODUCTIONMapReduce, as a parallel and distributed programmingframework, is becoming more and more widely used [17].Hadoop, an implemetation of the MapReduce framework,has been adopted by many companies, including the majorIT players in the world, such as Facebook, eBay, IBM andYahoo [2] [16]. Recently, Facebook engineers team [5] haveclaimed that they developed a new MapReduce topologycalled Corona. Also, Yahoo developers [10] and IBM devel-opers [11] have also announced that they have developed andused a new MapReduce implementation, YARN. These newimplementations of the MapReduce use architectures thatare different from the one used by Hadoop, the previousimplementation. For example, in Corona and YARN, the

functions played by a single master node in Hadoop is splitinto two sets, respectively, played by two master nodes, a Re-source Manager and a JobTracker. In other words, Coronaand YARN use two masters to perform the functions whichare performed by the single master in Hadoop. While, thesearchitectural changes in the new MapReduce implementa-tions bring a number of benefits such as improving perfor-mances and reducing the chance of creating bottlenecks inthe system [5][11]. They also have implications on otheraspects of the system such as security provisioning. Forexample, these changes will alter the way different systemcomponents interact, and how and from which entities vari-ous requests and responses are generated. These will lead tochanges in the risk landscape in the system, and thereforehow security should be provided. However, existing securitysolution designs for MapReduce have not taken into accountthese architectural changes in the new MapReduce imple-mentations. For example, the encryption service has beendesigned by Jason and Subatra (2013). They examined theirproposed security service for the old MapReduce frameworkversion [1]. The scheme encrypts data which is uploaded bythe client and stored in DFS. Furthermore, Somu, N. and etal. (2014) proposed another security service [12]. They havedesigned an authentication service by which the client can beauthenticated to the MapReeduce framework. Their designschemes show that they perform well in term of security andperformance in the old MapReduce version. Their designsassume that, there is only one master node (JobTracker) inwhich the client authenticate to and submit his job. Thismaster node also retrieves the input data splits which are up-loaded to the DFS, and allocates the TaskTrackers (Workernodes). However, in the current MapReduce framework im-plementation, it is no longer the case, there are two masternodes, one master node, Resource manager, is responsiblefor job submission, providing job IDs and the paths of jobfiles to a clients, then the client uploads his input data filesinto the DFS. The other master node, JobTracker, retrievesinput splits which are uploaded by the client to the DFS,and allocate the TaskTrackers which are assigned by the Re-source Manager. This raise concerns in term of security asthe way of MR components interact and accessing the datastored in the DFS. An example of these concerns, in the newMR framework implementation the client authenticates tothe Resource manager rather than JobTracker and his inputsplits are retrieved by the other master node. This meansthat the JobTracker also needs proper secret keys to haveaccess to the DFS and retrieves the input splits when the

Page 2: MapReduce: MR Model Abstraction for Future Security Study

clients job submission is completed. Also, in the new MRimplementation, the worker daemons have to authenticateto the Resource Manger to obtain the admission to the MRframework rather than authenticate to the JobTracker.The authors’ main research task is to investigate and designsecurity solutions for MapReduce applications, as designs ofsome security services for MapReduce should be architec-tural dependent, and as the existing MapReduce securitysolutions are largely homogeneous and only secure entranceof a job execution [13] [9]. For this, the solution should be de-signed based on a generic MapReduce framework. However,first, there is not a generic MapReduce model in literature,which captures the recent changes in the MapReduce archi-tecture. Second, through literature research in MapReducearchitectures [2][4] , and exiting MapReduce applications,such as used by [5][10] [11], we have identified that thereis a mismatch between the original MapReduce architectureand what has been deployed in recent real-life applications.So this paper as our first step work, tries to bridge this gapby presenting an abstract MapReduce model capturing thefeatures and properties of the most recent deployments ofMapReduce. We investigate the existing MapReduce archi-tecture model and synthesising from these real-life deploy-ments and in future steps, build and present a generic secu-rity solution requirements and design for such model.In detail, the remaining part of this paper is structured asfollows. Section 2 introduces MapReduce, its architectureand architectural components and explains, at a genericlevel, how it is typically deployed in a Cloud environment.Section 3 analyses, in detail, a simple example of job execu-tion process using MapReduce, and based on this analysis,the section presents an abstract model of MapReduce execu-tion using MapReduce. It starts with an example job execu-tion process using MapReduce, and based on this example, itconstructs an abstract model for MapReduce highlights theprocedures and interactions among its architectural com-ponents during a job execution process. Finally, Section 4concludes the paper and outlines out future work.

2. MAPREDUCE ARCHITECTUREThis section introduces the MapReduce framework, high-lighting its physical and logical components and explainshow the framework may be deployed in practice.

2.1 An OverviewMapReduce is a new programing model that supports par-allel processing and distributed resource sharing. It uses aset of distributed nodes (called servers) that work collabo-ratively to process a large amount of data or to execute aspecified job. A job execution in MapReduce is carried outin two distinctive phases: Map and Reduce. In the Mapphase, a set of nodes, called Mappers, are used to processand convert one set of data (input) into another set of data(intermediate results). This set of input data is broken intokey-value pairs called tuples. In the Reduce phase, numberof nodes, called Reducers combines and process the inter-mediate data result to produce smaller set of tuples calledjob result. As indicated in the framework name, in MapRe-duce, Map tasks are always executed before the Reduce tasks[17]. Figure 1 shows the generic sequence of the process inMapReduce applications.

2.2 MapReduce Components

Figure 1: A typical MapReduce task sequence [3]

A MapReduce system consists of a number of componentswhich are summarized in Figure 2. As shown in the figure,these components can largely be classified into two maingroups (called clusters). One is the Distributed File System(DFS or Distributed Storage) that is used to store date filesduring job executions. The DFS consists of one name nodeand multiple data nodes. The other is the Processing Frame-work (PF) cluster that is used to carry out job executions(or distributed computations). PF consists of two masternodes and a number of worker nodes. In addition to thesephysical components, there are also logical components (i.e.software components) and data components. In the follow-ing, we discusses the functionalities of these components [4][5] [15] [16] .

2.2.1 Physical ComponentsPhysical components are actual physical nodes used to hostsoftware components, clients and servers. Depending on theroles played by the servers, they may be master workers orslave workers. The corresponding nodes hosting these com-ponents are therefore referred to as user machines, masterworker nodes (or master nodes, in short) and slave workernodes (or slave nodes, in short).

1. User Machines:A user machine runs a client appli-cation, which largely performs two major tasks: i)submit a job to the MapReduce master node, and ii)specify the configuration, trigger and monitor the Jobexecution [8].

2. Masters nodes: in Figure 2, master nodes are of thefollowing types.

(a) Central Cluster Manager (also called ResourceManager):This node is responsible for managingthe job schedules and the cluster resources bytracking the nodes in the cluster and the amountof free resources on these nodes. There is one suchnode per cluster.

(b) JobTracker(also called Master Node or MRAp-plication Master): It runs a program, called mas-ter daemon that is responsible for managing (mon-itoring and tracking the progress of) each indi-vidual job executions. It also negotiates with theResource Manger regarding the use of resources(containers i.e. Slaves). Usually, there is one ormore JobTrackers per cluster. In older MapRe-duce framework version, there is only one MasterNode that plays the roles of both Resource Man-ager and Job Trackers.

(c) NameNode(also called Catalogue Server or Meta-data Node): This node manages and maintains aset of DataNodes, i.e. a file system and metadatafor all the directories and files stored in DataN-odes. It is usually part of DFS cluster and thereis one per cluster.

Page 3: MapReduce: MR Model Abstraction for Future Security Study

3. Slave nodes: they are of two types, TaskTrackers andDataNodes.

(a) TaskTracker : A TaskTracker node may host asingle Mapper, a single Reducer or both. Usu-ally, there is more than one TaskTracker node percluster. As shown in Figure 2, each Task- Trackerruns the worker daemon to execute multiple tasks.

(b) DataNode: This node is also part of the DFSand, sometimes, named as a Storage Server. It isthe node where the actual data files are stored inunits of data blocks (each being 64MB or 128MB),and shared. Each data block has its own block ID,and, for fault-tolerance and performance consid-erations, it is replicated in a number of DataN-odes. A DataNode periodically sends a reportto the NameNode listing the blocks it stores, asshown in Figure 2. There are typically multipledata nodes in a cluster. In some cases the DataN-ode may be the same node as the TaskTrackernode for performance consideration [15] [17].

Ask for Read

and Write Data

Sub.

Job

Assign

Tasks

Read/Write

data Periodically Report

Data Processing

Framework Cluster

DFS (Distributed

File System) Cluster

Slave Work-Pattern Nodes Master Work-Pattern Nodes

Client Job

Program

Client Machine

TaskTracker m (Worker Node)

Worker Daemon

Reduce Task

Map Task

TaskTracker 2 (Worker Node)

Worker Daemon

Reduce Task

Map Task

TaskTracker 1 (Worker Node)

Worker Daemon

Reduce Task

Map Task

JobTracker (Master Node)

Master Daemon

Resource Manager

Central Cluster Manager

DFS NameNode (Catalogue Server)

Data Location

Block1 DN 1,2

Block2 DN2,n

Block3 DN1,n

…… ……

DFS DataNode n (Storage Server)

Block2

Block3

DFS DataNode 2 (Storage Server)

Block5

Block1 Block2

DFS DataNode 1 (Storage Server)

Block1

Block3

Block5

Figure 2: MapReduce framework components

2.2.2 Logical Components and DataLogical components, or programs, are executed on the abovementioned physical components to accomplish a computa-tional task or to process some data. These logical compo-nents and the types of data involved are explained below.

1. Master daemon is executed in the JobTracker node,and performs the tasks assigned to JobTracker.

2. Worker daemon is executed in a TaskTracker node.Its function is to launch and manage the mapper andreducer child daemons. (i.e. the mission of a workerdaemon is to execute the Map and Reduce tasks).

3. Job history daemon is also executed on the Job-Tracker node. Its function is to store and archive thejob history for later interrogation by the user if desired.

4. Input data broadly refers to various input data and/orfiles used in a MapReduce computation. It includes ajob configuration file, Input data files and computedinput splits information. The data in an input datafile are parsed into key-value pairs by Mappers. TheMappers then produce intermediate buffered data.

5. Intermediate Buffered Data is the output of theMap phase, Assuming that the number of reducersused in a job execution is R, then the intermediatedata is a set of R regions called Partitions (i.e. foreach reducer assigned to a job execution, there is adata partition output from the Map phase to process).

6. Shuffle Handler is a service designed to sort andgroup the intermediate key-values pairs according tothe key (an example in the next section explains whatthis key is and how the key-value is used to presentthe data). It is run on the TaskTracker Node.

7. Output Data is the final result produced by the Re-ducers.

8. Job Configuration is a set of variables and parame-ters specific to a certain Job and specified bythe User.

Figure 2 depicts both physical and logical components ofthe MapReduce Framework as organized in two Cluster cat-egories.

2.3 Implementing and Hosing MapReduce Com-ponents in Cloud

MapReduce components are typically hosted in a virtual en-vironment by a MapReduce provider. Virtualization allowsmultiple independent Operating System components be runon a single physical node. It allows the sharing of physical re-sources by multiple users [8]. The MapReduce framework ishosted and implemented in such environment due to the ad-vantages provided by such technology. For instance, ServerConsolidation, increase up time (e.g. fault tolerance, livemigration, high availability, storage migration) and isolateapplications tasks within one physical server [7].The MapReduce application components run on the cloudsystem are hosted on the top of Virtual Machines (VMs).Figure 3 shows how different components of a MapReduceapplication such as master node (Master Daemon), workernodes (Worker Daemons), DataNodes and NameNode arehosted. They can be run on the top of the same physi-cal server. They are isolated using virtualization host ma-chines and virtual switches. These different componentsshare physical resources (e.g. CPUs and Rams). At thenetwork level, they are isolated by using virtual switchesand VLANs mechanisms. Communications among differ-ent set of MapReduce components hosted in different phys-ical servers are carried out through external network fabric.The Hypervisor is the core of the virtualization technique.As a middleware, the Hypervisor is responsible for creating,managing and executing VM operating system instances andsharing the physical resources. There are number of Cloud

Page 4: MapReduce: MR Model Abstraction for Future Security Study

service delivery models [6], but the following two are used tohost a MapReduce application. These two are different interm of services provided by the cloud providers as following:

1. Infrastructure-as-a-Service (IaaS): in this deliv-ery model the customer allocates the virtual resourcesas needed. The IaaS provider delivers the requiredstorage and networking devices, in form of wrappedservice. They also provides a basic security needs, in-cluding physical security and system security such asfirewall, Intrusion Prevention System (IPS) and Intru-sion Detection System (IDS). A MapReduce applica-tion can be implemented using this service deliverymodel in which the application can be managed di-rectly by the MR Client through the master nodes.

2. Platform-as-a-Service (PaaS): This service deliv-ery model offers the facilities such as Software Develop-ment Kit (SDK) (e.g. Python and Java) which can beused by customers to write their own programs to de-velop their own applications. With this service delivermodel, customers using these facilities do not have tomanage the required hardware or software componentssuch as VMs, virtual switches. CPUs and RAMs. AMapReduce application can also be implemented usingthis Cloud service delivery model.

Cloud

Internet

Client

Machine

VM1 Master Node (JT)

VM2 Name Node (NN)

VM3 Data Node (DN)

Virtual Switches and VLANs

Hypervisor (VMware, Hyper-V …..)

Shared CPUs, RAMs and Other Physical Resources

VM1 Worker

Node (TT)

VM2 Data Node (DN)

VM3 Worker

Node (TT)

Virtual Switches and VLANs

Hypervisor (VMware, Hyper-V …..)

Shared CPUs, RAMs and Other Physical Resources

Network Fabric (Physical Switches, Routers, Firewalls, External Storage, and Power Units)

Figure 3: Hosting MapReduce components in the Cloud

3. A GENERIC MAPREDUCE COMPUTA-TION MODEL

The components of the MapReduce model execute and pro-cess the input data based on a specific execution data work-flow. This work-flow involves a number of interactions be-tween model components. This section, starting with a prac-tical usecase scenario, attempts to build an abstract modelof MapReduce computation capturing the data flows as wellas the interactions among MapReduce components during ajob execution. This abstract model can serve as a basis forour future design of a security solution for MapReduce.

3.1 MapReduce Job Execution: an ExampleTo describe the data execution work-flow in brief, a practicalsimplified example is illustrated. Suppose that there is a bignumber of records. These records contains temperaturesof UK cities which are daily bases registered for one year.They are stored in number of database files as shown inthe following tables, these files are stored in number of datablocks in the DFS. In this example, the Job is submitted bythe client to the Master. The duty of the Job is to find outthe maximum temperature of each city (each city has morethan one record per file).

File1 File 2 File n

……

Record No.

City Name

Temp. Cent.

Other Data

1 York 0

2 Manch. 19

3 Leeds 9

4 Leeds 20

5 London 13

6 Leeds 18

7 London 6

8 York 13 ……… ……. ……

Record No.

City Name

Temp. Cent.

Other Data

1 Manch. 15

2 Manch. 10

3 York 9

4 London 20

5 Durham 20

6 London 18

7 Durham 6

8 York 13 ……… ……. ……

Record No.

City Name

Temp. Cent.

Other Data

1 London 15

2 Manch. 10

3 York 9

4 London 20

5 Leeds 13

6 London 18

7 Durham 6

8 York 13 ……… ……. ……

Figure 4: A Number of files store temperatures of UK cities

To simplify this example for explanation purpose, let con-sider number of files n = 10, each file stored in a one dateblock, and each input split size is the same as the data blocksize. Figure 5 shows the MapReduce components which areinvolved in processing the input data for this task, as a lay-out. First, the Client (User) determines the Job configu-ration parameters, input data format and the location ofthe input data. In our example, the input files are dividedinto input splits (10 input splits), and there is an input splitper mapper to process. They are assigned to the Mappersby the Master based on job configuration files and parame-ters involved [14]. Each Mapper read and parsing its inputsplit into key-value pairs (in our example, the key is the cityname and the value is its temperature). The Mapper startsthe Map execution Task (in this example, the map phaseis to find out the highest temperature for each city in eachinput block i.e. file). The output result of the Mappers isthe intermediate buffered data which is written locally in anumber of data partitions. They are equal to the numberof Reducers (in our example two data partitions and twoReducers). After the Mappers completed their task, eachpartition is assigned to a Reducer. The Reducers start theirexecution task which is finding the maximum temperatureper city but before that the Reducers themselves have toperform the shuffle process. In the shuffle process the in-termediate data (within the same partition) is sorted andgrouped based on the key. (For example, York city (i.e.key) has a couple of temperatures values 18, 11, 23, 9 as anoutput of different Mappers, as listed in the intermediate re-sult shuffle table in figure 5). The final output result of eachreduce task, which is a list of cities and the maximum tem-perature associated with each city are written into separatefiles on the DFS system as shown in Figure 5.

3.2 Interactions of MapReduce ComponentsThe following points describe a typical execution flow of jobbeing executed by the MapReduce framework from the mo-

Page 5: MapReduce: MR Model Abstraction for Future Security Study

N City Temp. Other

1 York 0

2 Manch. 19

3 Leeds 9

4 Leeds 20

5 London 13

6 Leeds 18

7 London 6

……… ……. ……

No. City Temp. Other

1 Manch. 15

2 Manch. 10

3 York 9

4 London 20

5 Durham 20

6 London 18

7 Durham 6

……… ……. ……

No. City Temp. Other

1 London 15

2 Manch. 10

3 York 9

4 London 20

5 Leeds 13

6 York 13 ……… ……. ……

Masters Task Assign Partitions Assign Input Splits

DFS System Contains

Number of Input Files

DFS System Contains

Number of Output Files

File1 (Block1)

File2 (Block2)

File10 (Block10)

Intermediate Results

Mappers Reducers

Read

ing

Inpu

t Da

ta P

hase

in w

hich

the

Data

is

pars

ed to

key

-val

ue p

airs

Input

Spilt

1

Input

Spilt

2

Input

Spilt

3

Key value

York 0

Manch. 19

Leeds 9

Leeds 20

London 13

Leeds 18

London 6

York 13 Key value

Manch. 15

Manch. 10

York 9

London 20

Durham 20

London 18

Durham 6

York 13

Key value

London 15

Manch. 10

York 9

London 20

Leeds 13

London 18

Durham 6

York 13

Map

Task

Map

Task

Map

Task

Key value

York 18

Manch. 19

Leeds 20

London 27

Durham 19

Bradford 20

Key value

Manch 23

York. 11

London 23

Cardiff 27

Wales 19

Leeds 20

Key value

London 20

Manch. 24

Leeds 20

Newscast. 26

Brighton 30

Glasgow 16

Shuffle the Map Task

Output Key value

York 18, 11, 23, 9…

Manch. 19, 23, 24, 13….

Leeds 20, 13, 22…

…….

Luton 11, 17,..…

Shuffle the Map Task Output

Key value

London 27, 21, 23, 19…

Durham. 19, 20, 14, 21….

Bradford 20, 13, 22…

………

Glasgow 11, 17...

Reduce

Task

Reduce

Task

Output File1

Output File2

City Temp

York 18

Manch. 24

Leeds 22

Durham 15

….

City Temp

London 27

Durham. 21

Bradford 22

Brighton 30

….

P1

P1

P1

P2

P2

P2

P: partition, exe: Execution Client

Submit the Job

Reading and Parsing Process Writing Process

Map exe. Process Reduce exe. Process

Figure 5: Simplified practical Example of Data Execution Flow in MapReduce Model.

ment when the job is submitted by the client to the momentwhen it is successfully completed. This data execution work-flow is further illustrated in Figure 6.

1. First, the user (on the Client node) runs the MapRe-duce program (Job Client) after it has been authen-ticated to the master node using his credentials. Theused credentials are usually a Username and Password(Figure 6 (1)(2)). The master node might be either theResource Manager in the new version of MapReduceframework or it is the JobTracker itself in previousversion of the framework. This initial authenticationis done by using Kerberos. The Kerberos protocol in-volves KDC Key Distributer Centre server in whichthe client and the server are registered. The KDCcontains AS Authentication Service and TGS TicketGranting Service. The client uses his credentials tomake authentication request to the AS in order to getthe TGT Ticket Granting Ticket. The AuthenticationServer then issues a TGT to the client. The Clientthen send it to the TGS requesting the service ticket.Once the client has the service ticket, he uses it toaccess the designated resource (i.e. service).

2. The Client requests the job ID (i.e. new applicationID) and the Path to write the Job definition files (Fig-ure 6 (3)).

3. Once the Client acquires the Job ID (Figure 6 (4)), theClient contacts the Namenode to start writing into the

DFS (Figure 6 (5)), the client authenticate to the Na-meNode using the service ticket (contains the addressof the remote service/resource as well as the secret) is-sued by TGT server from the first interaction betweenthe client and AS.

4. The client is redirected to the DataNode and startcopying the Job resources (Configuration file, inputsplits which are computed by the Job Client) to thespecified directories (Figure 6 (6)).

5. Once the copying is completed the Client informs theMaster Node that the Job is ready to be launched, thisis called client job submission (Figure 6 (7)).

6. The Resource Manager allocates the JobTracker, andthe master daemon (Application Master) is launchedin this JobTracker (Figure 6 (8)(9)). This daemon cre-ates an object and instant to present the Job and keepstrack to the progress of the Jobs Tasks.

7. In order for the JobTracker to create a list of tasks tobe run, it retrieves the input splits (computed by theclient) from the DFS (Figure 6 (10)), whereas one maptask object assigned for each input split. JobTrackeralso determines the reduce tasks objects.

8. The JobTracker requests TaskTrackers from the Re-source Manager for all Map and Reduce tasks (Figure6 (11)). In this request, as a performance prospec-tive (i.e. consideration), JobTracker sends the data

Page 6: MapReduce: MR Model Abstraction for Future Security Study

location of each Map task, trying to obtain a Mappernode close to its split input.

9. Then once the tasks (either Map or Reduce Tasks)have been assigned to TaskTrackers, the mapper andreducer child daemons are started launching by theworker daemon (Figure 6 (13) (14)). However, beforethe map tasks started, the worker daemon retrieves theJob resources from the DFS (DataNodes) and copiedthem locally (Figure 6 (12)). The map task also, parsesthe input data into key-value pairs as explained in theprevious example.

(16) Write

Reduce Task

Output to DFS

(2) Run Job

(1) User Auth.

Client Machine

MapReduce

Program

Username [ad ]

Password [………]

Resource Manager

Central Cluster Manager

DFS Data Node n (Storage Server)

Block2

Block3

DFS Data Node 1 (Storage Server)

Block1

Block3 Block5

DFS NameNode (Catalogue Server)

Data Location

Block1 DN 1,2

Block2 DN2,n

Block3 DN1,n

…… ……

(5) Ask to

write into the

DFS system

(6) Redirected

and start writing

the designated

Files.

JobTracker (Master Node)

Master Daemon

(Application Master)

(9) Run

Master

Daemon

(8) Allocate

JobTracker

(10) Retrieve

Input Splits

(11) Request

TaskTrackers

(13) Allocate the

TaskTrackers

(3) Request New Job ID and

the path of Job files

(4) Obtain Job ID and Job

files Path

(7) Job submission

(14) Run

Child

Daemons

(12) Retrieve

Job Resources

TaskTracker 1 (Worker Node)

Worker Daemon

Map Task

Child daemon

Reduce Task Child daemon

Local Data

TaskTracker 1 (Worker Node)

Worker Daemon

Map Task

Child daemon

Reduce Task Child daemon

Local Data

(15) Read Map

Task Output

(17) Inform

the Master

(18) Update

the Client

Figure 6: Data execution work-flow in MapReduce Model

10. After Map tasks have been completed by the Mappers,the Reducers start to read the output data of the maptask (Figure 6 (15)). Each reducer is assigned to onedata partition as explained in the previous example.The reducer shuffles the map phase output data (i.e.intermediate data) in each data partition. This par-tition was specified before by the Master using userconfiguration file. These intermediate data at the spec-ified partition, might be located in different nodes asthey are an output of different mappers child daemons.So the reducer child daemon connects to designatedWorker Nodes (daemons), where the partial data ofthe indexed partition stored locally, to gain access tothe desired data (Figure 6 (15)).

11. Once the reduce task is executed and completed, theworker child daemon writes its output result into a fileat the DFS system (Figure 6 (16)).

12. Finally the Reducer (Work daemon) informs the Mas-ter Node (JobTracker) that the task has been com-pleted successfully or not (Figure 6 (17)). Once theentire reducers involved in the submitted job have com-pleted the tasks assigned to them, they notify the Mas-ter Node (JobTracker). Then the JobTracker updatesthe client with the Job status (Figure 6 (18).

4. CONCLUSIONIn this paper, we have examined a real-life application ofthe MapReduce Model. Based on this examination, we havebuilt an abstract MapReduce usecase model that capturesthe main characteristics of an MapReduce execution at ageneric level, the functionality of each of its componentsand the interactions among the components in executing acomputational job. This abstract model which addresses theprevious MapReduce changes will be used in our future workon designing a cost effective security solution for MapRe-duce, because to study any security services for such model,it is necessary to understand its anatomy and including arecent MapReduce model.

5. REFERENCES[1] Jason Cohen and Subatra Acharya. Towards a trusted

hadoop storage platform: Design considerations of anaes based encryption scheme with tpm rooted keyprotections. In Ubiquitous Intelligence and Computing,2013 IEEE 10th International Conference on and 10thInternational Conference on Autonomic and TrustedComputing (UIC/ATC), pages 444–451. IEEE, 2013.

[2] Jeffrey Dean and Sanjay Ghemawat. Mapreduce:simplified data processing on large clusters.Communications of the ACM, 51(1):107–113, 2008.

[3] Google Developers. Mapreduce for app engine. GoogleInc, 2014.

[4] James Dyer and Ning Zhang. Security issues relatingto inadequate authentication in mapreduceapplications. In High Performance Computing andSimulation (HPCS), 2013 International Conferenceon, pages 281–288. IEEE, 2013.

[5] Facebook Engineers. Under the hadoop schedulingmapreduce jobs more efficiently with corona. FaceBookInc, 2012.

[6] Diogo AB Fernandes, Liliana FB Soares, Joao VGomes, Mario M Freire, and Pedro RM Inacio.Security issues in cloud environments: a survey.International Journal of Information Security,13(2):113–170, 2014.

[7] David Marshall. Top 10 benefits of servervirtualization. InfoWorld IT, 2011.

[8] George Ou. Introduction to server virtualization.TechRepublic, 2006.

[9] GS Sadasivam, KA Kumari, and S Rubika. A novelauthentication service for hadoop in cloudenvironment. In Cloud Computing in EmergingMarkets (CCEM), 2012 IEEE InternationalConference on, pages 1–6. IEEE, 2012.

[10] Sumeet Singh. Hadoop at yahoo!: More than ever

before. Yahoo! developeraAZs network, 2013.

[11] Sumeet Singh. Introduction to yarn. Hadoop developer

and administrator IBM developeraAZs team, 2013.

Page 7: MapReduce: MR Model Abstraction for Future Security Study

[12] Nivethitha Somu, A Gangaa, and VS Shankar Sriram.Authentication service in hadoop using one time pad.Indian Journal of Science and Technology, 7(4):56–62,2014.

[13] Nivethitha Somu, A Gangaa, and VS Shankar Sriram.Authentication service in hadoop using one time pad.Indian Journal of Science and Technology, 7(4):56–62,2014.

[14] Hadoop Tutorial Team. Hadoop architecture. TangientLLC, 2013.

[15] Tom White. Hadoop: The definitive guide. ” O’ReillyMedia, Inc.”, 2012.

[16] Jing Xiao and Zhiwei Xiao. High-integrity mapreducecomputation in cloud with speculative execution. InTheoretical and Mathematical Foundations ofComputer Science, pages 397–404. Springer, 2011.

[17] Paul Zikopoulos, Chris Eaton, et al. Understanding bigdata: Analytics for enterprise class hadoop andstreaming data. McGraw-Hill Osborne Media, 2012.