A security framework in G-Hadoop for big data computing across distributed Cloud data centres

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>Journal of Computer and System Sciences 80 (2014) 9941007</p><p>Contents lists available at ScienceDirect</p><p>Aa</p><p>JiRa Sb Ic Sd Fe If In</p><p>a</p><p>ArReReAcAv</p><p>KeCloMDaSeM</p><p>1.</p><p>Hlaprdaw</p><p>tea</p><p>*</p><p>ht00Journal of Computer and System Sciences</p><p>www.elsevier.com/locate/jcss</p><p>security framework in G-Hadoop for big data computingcross distributed Cloud data centres</p><p>aqi Zhao a, Lizhe Wang b,, Jie Tao c, Jinjun Chen d,, Weiye Sun c,ajiv Ranjan e, Joanna Koodziej f, Achim Streit c, Dimitrios Georgakopoulos e</p><p>chool of Basic Sciences, Changchun University of Technology, PR Chinanstitute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, PR Chinateinbuch Center for Computing, Karlsruhe Institute of Technology, Germanyaculty of Engineering and IT, University of Technology Sydney (UTS), Australianformation Engineering Laboratory, CSIRO ICT Centre, Australiastitute of Computer Science, Cracow University of Technology, Poland</p><p>r t i c l e i n f o a b s t r a c t</p><p>ticle history:ceived 30 September 2012ceived in revised form 15 March 2013cepted 27 August 2013ailable online 12 February 2014</p><p>ywords:ud computing</p><p>assive data processingta-intensive computingcurity modelapReduce</p><p>MapReduce is regarded as an adequate programming model for large-scale data-intensiveapplications. The Hadoop framework is a well-known MapReduce implementation thatruns the MapReduce tasks on a cluster system. G-Hadoop is an extension of the HadoopMapReduce framework with the functionality of allowing the MapReduce tasks to run onmultiple clusters. However, G-Hadoop simply reuses the user authentication and job sub-mission mechanism of Hadoop, which is designed for a single cluster. This work proposesa new security model for G-Hadoop. The security model is based on several security so-lutions such as public key cryptography and the SSL protocol, and is dedicatedly designedfor distributed environments. This security framework simplies the users authenticationand job submission process of the current G-Hadoop implementation with a single-sign-on approach. In addition, the designed security framework provides a number of differentsecurity mechanisms to protect the G-Hadoop system from traditional attacks.</p><p> 2014 Elsevier Inc. All rights reserved.</p><p>Introduction</p><p>Today, data explosion is commonly observed in various scientic and social domains, such as GeoScience, Life Science,igh Energy and Nuclear Physics, as well as Materials and Chemistry. Modern scientic instruments, the Web, and simu-tion facilities are producing huge data in the range of several petabytes. Currently, MapReduce [3] is commonly used forocessing such big data. With a Map and a Reduce function, MapReduce provides simple semantics for users to programta analysis tasks in the code. Additionally, the parallelism in MapReduce is automatically done by a runtime framework,hich is especially friendly for application developers.The Apache Hadoop [1] is one of the well-known MapReduce implementations [3,18,7,8,4,11,16] and has been used/ex-</p><p>nded by scientists as the base of their own research work [15,19,10,6]. Hadoops MapReduce architecture is based onmaster/slave communication model with one JobTracker as the master and multiple TaskTrackers acting in the role of</p><p>Corresponding authors.E-mail addresses: Lizhe.Wang@gmail.com (L. Wang), jinjun.chen@gmail.com (J. Chen).</p><p>tp://dx.doi.org/10.1016/j.jcss.2014.02.00622-0000/ 2014 Elsevier Inc. All rights reserved.</p></li><li><p>J. Zhao et al. / Journal of Computer and System Sciences 80 (2014) 9941007 995</p><p>slM</p><p>in[2syusdaGac</p><p>tucolikbewtotaas</p><p>fr</p><p>Seguiman</p><p>2.</p><p>thoArean</p><p>taJoidinaves. Hadoop uses its own le system, the Hadoop Distributed File System (HDFS), to manage the input/output data of theapReduce applications.G-Hadoop [25] is a MapReduce implementation targeting on a distributed system with multiple clusters, such as a Grid</p><p>frastructures [9,12,22,26,20], a Cloud [13,27,24,21], distributed virtual machines [23], or a multi-data centre infrastructure8,17]. In order to share data across multiple administrative domains, G-Hadoop replaces Hadoops native distributed lestem with the Gfarm Grid le system [19]. MapReduce applications in G-Hadoop are scheduled across multiple clustersing a hierarchical scheduling approach. The MapReduce tasks are rstly scheduled among the clusters using Hadoopsta-aware scheduling policy and then among compute nodes using the existing cluster scheduler on the target clusters.-Hadoop maintains the master/slave model of Hadoop, where the slave nodes are simple workers, while the master nodecepts the jobs submitted by the user, splits them into smaller tasks, and nally distributes the tasks to the slave nodes.The current G-Hadoop system reuses the Hadoop mechanisms for user authentication and job submission, which is ac-</p><p>ally designed for single cluster environments. The mechanism applies the Secure Shell (SSH) protocol to establish a securennection between a user and the target cluster. This kind of mechanism is not suitable for a distributed environment,e Grid, that contains several large-scale clusters. In G-Hadoop, for example, an individual SSH connection has to be builttween the user and each single cluster. This approach does not handle the system as a whole; rather it works separatelyith the systems components. In addition, with the Hadoop security approach a user must log on to each cluster in orderbe authenticated before being capable of using its resources for running MapReduce tasks. This is undoubtedly a tedioussk for the users. Therefore, a more general, system-wide solution is required to hide the architecture details of the systemwell as to free the user from the burden.In this work, we designed a new security model for the G-Hadoop framework to meet the above challenges. The security</p><p>amework is designed with the following properties:</p><p> A single-sign-on process with a user name and password: A user logs on to G-Hadoop once simply with his user nameand password. In the following, all resources of the underlying system are available for the user and can be accessed bythe user without additional self-participating security processes. The procedure of being authenticated by the differentclusters of the underlying system is automatically performed by the security framework in the background.</p><p> Privacy of user information: The user information, such as authentication information, is invisible at the side of theslave nodes. Slave nodes take tasks, which are assigned by the master node, without being aware of user information,including the user name and password.</p><p> Access control: The security framework protects resources of the whole system from misuse or abuse of non-professional users. Users of the system only have the right to access the resource of a cluster that he can accesswith an SSH connection.</p><p> Scalability: A cluster can be easily integrated or removed from the execution environment without any change of thecode on the slave nodes or any modication of the security framework itself.</p><p> Immutability: The security framework does not change the existing security mechanism of the clusters. Users of acluster can still rely on their own authentication proles to get access to the clusters.</p><p> Protection against attacks: The security framework protects the system from different common attacks and guaranteesthe privacy and security by exchanging sensitive information, such as information of authentication and encryption. Itis also capable of detecting the fraudulent party of a fake entity to avoid abusing or illegal access to the resources byan attacker.</p><p>The remainder of the paper is organized as following. Section 2 gives a short introduction to the G-Hadoop architecture.ction 3 describes the designed security framework in detail. Section 4 demonstrates how the proposed security frameworkarantees the security of the underlying system and analyzes its security performance against common attacks. An initialplementation of the framework is described in Section 5. Finally, the paper concludes in Section 6 with a brief summaryd several future directions.</p><p>The G-Hadoop architecture</p><p>G-Hadoop maintains the Hadoops MapReduce architecture with a master/slave communication model. The JobTracker ise master server in the MapReduce framework and represents a central service that is responsible for managing the controlw of running MapReduce jobs. The JobTracker receives new jobs from its clients and splits the jobs into smaller tasks.built-in scheduler dispatches the individual Map and Reduce tasks, and coordinates the order of the map phase and theduce phase. The JobTracker is also responsible for monitoring the health status of all running tasks and detecting failuresd re-scheduling the failing until all tasks of a job are nished.TaskTrackers are the workhorses of the JobTracker. Each TaskTracker has a specic number of slots denoting how many</p><p>sks the TaskTracker is congured to run at the same time. The TaskTrackers periodically report their health status to thebTracker using a heartbeat message. This message includes the current progress of all running tasks and the number ofle slots available on the TaskTracker node. In response, the JobTracker sends new instructions to the TaskTracker. Thesestructions include the assignment of a new task or the termination of a running task. To sandbox each running task the</p></li><li><p>996 J. Zhao et al. / Journal of Computer and System Sciences 80 (2014) 9941007</p><p>Tam</p><p>aseGf</p><p>poG-disyTrfonoco</p><p>3.</p><p>anea</p><p>3.1</p><p>anSeanphsiasnothFig. 1. G-Hadoop software architecture.</p><p>skTracker runs a separate JVM per slot. It is the TaskTrackers responsibility to monitor the health status of the active task,anaging its log les, detecting failures, and reporting back to the JobTracker.Based on the Hadoop MapReduce implementation, we developed the G-Hadoop framework for MapReduce executions in</p><p>wider area. The goal of G-Hadoop is to enable large-scale distributed computing across multiple clusters. To share datats across multiple administrative domains, G-Hadoop replaces HDFS, the Hadoops native distributed le system, with thearm le system.Fig. 1 shows an overview of the G-Hadoop high-level architecture and its basic components. As mentioned, the pro-sed architecture of G-Hadoop follows a master/slave communication model. The master node is the central entity in theHadoop architecture. It is responsible for accepting jobs submitted by the user, splitting the jobs into smaller tasks, andstributing these tasks among its slave nodes. It is also responsible for managing the metadata of all les available in thestem. The master node contains a metadata server that manages les distributed among multiple clusters, and a Job-acker that is responsible for splitting jobs into smaller tasks and scheduling these tasks among the participating clustersr execution. Tasks are submitted to the queue of the cluster scheduler (e.g. Torque) using a standard interface. The slavede of G-Hadoop enables the execution of MapReduce tasks on the computing nodes of the participating clusters. Its mainmponent is a TaskTracker that is in charge of accepting and executing tasks from the JobTracker of the master node.</p><p>The security framework</p><p>The G-Hadoop prototype currently uses the Secure Shell (SSH) protocol to establish a secure connection between a userd the target cluster. This approach requires a single connection to each participating cluster, and users have to log on toch cluster for being authenticated. Therefore, we designed a novel security framework for G-Hadoop.</p><p>. The concept</p><p>There are several successful security solutions, including the SSL protocol [14], the Globus Security Infrastructure [5]d the Java security solutions. The cryptographic protocol Secure Sockets Layer (SSL) and its successor Transport Layercurity (TLS) are widely used to provide a secure communication over the public networks. It takes two phases to set upSSL connection. The rst phase uses digital signatures and asymmetric cryptography for authentication, while the secondase is for data transmission. The Globus Security Infrastructure (GSI) is a standard of Gird security and provides thengle-sign-on process in the Grid environment as well as authenticated communication by using asymmetric cryptographythe base for its functionality. As a security standard, GSI also meets different security requirements. The Java platformt only supports the cross-platform capabilities, but also provides a variety of security features and solutions. Most ofese security features and solutions are encapsulated in three standard Java extensions: Java Cryptography Extension (JCE),</p></li><li><p>J. Zhao et al. / Journal of Computer and System Sciences 80 (2014) 9941007 997</p><p>Jaco</p><p>fosiamfrtiGanrem</p><p>3.</p><p>3.</p><p>teancaby</p><p>Sialnosl</p><p>exar</p><p>tecova Secure Sockets Extension (JSSE), and Java Authentication and Authorization Service (JAAS). Each of these extensionsntributes certain technologies to the security features and solutions.Our security model follows the authentication concept of the Globus Security Infrastructure (GSI) [2,5], while using SSL</p><p>r the communication between the master node and the CA server. GSI is a standard for Gird security. It provides a single-gn-on process and an authenticated communication by using asymmetric cryptography as the base for its functionality. Assecurity standard, GSI adopts several techniques to meet different security requirements. This includes the authenticationechanisms for all entities in a Grid, integrity of the messages that are sent within a Grid, and delegation of the authorityom an entity to another. A certicate, which contains the identity of users or services, is the key of the GSI authentica-on approach. In this work, we use similar certicate for the authentication between the master node and slave nodes of-Hadoop, where the master node is in charge of the single-sign-on process. The user needs only provide his user named password or simply log on to the master node; jobs can then be submitted to the clusters without requesting any othersources. A secure connection between the master node and slave nodes is established by the security framework using aechanism that imitates the SSL handshaking phase.</p><p>2. Applied terms</p><p>The following terms are used in the description of the novel security model:</p><p> User instance: A user instance is an object that is created by the master node for a user who has active jobs thejob is initialized or in execution. The user instance is deleted by the master node, when the user does not have activejobs any more. A user instance is an important but dynamic element of the security framework. It is related with otherinformation and data of the corresponding user as well as the authentication process on slave nodes. User instances areindependent between any two different users. Another property of user instance is sharing. If there are more than oneactive jobs of a user, these jobs share a common user instance. In case that the user is logged out but there are stillactive jobs for him, the user instance will not be deleted, thu...</p></li></ul>


View more >