EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE

BASED APPLICATIONS IN THE CLOUDShashank GugnaniTamas Kiss

OUTLINE

• Introduction - motivations

• Approach

• Previous work

• Implementation

• Experiments and results

• Conclusion

INTRODUCTION

Hadoop• Open source implementation of the MapReduce framework introduced

by Google in 2004

• MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner

• Map: input data in divided into chunks and analysed on different nodes in a parallel manner

• Reduce: collating the work and combining the results into a single value

• Monitoring, scheduling and re-executing the failed tasks are the responsibility of MapReduce

• Originally for bare-metal clusters – popularity in cloud is growing

INTRODUCTION

Aim• Integration of Hadoop with workflow systems and science gateways

• Automatic setup of Hadoop software and infrastructure

• Utilization of the power of Cloud Computing

Motivation• Many scientific applications (like weather forecasting, DNA

sequencing, molecular dynamics) parallelized using the MapReduce framework

• Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists

INTRODUCTION

Motivation• CloudSME project• To develop a cloud-based simulation platform for manufacturing and engineering• Funded by the European Commission FP7 programme, FoF: Factories of the Future• July 2013 – December 2015• EUR 4.5 million overall funding• Coordinated by the University of Westminster• 29 project partners from 8 European countries• 24 companies (SMEs) and 5 academic/research institutions• Industrial use-case: datamining of aircraft maintenance data using Hadoop based

parallelisation

INTRODUCTION

APPROACH

• Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster

• Cluster related parameters and input files provided by user

• Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job

• Two methods proposed: Single Node Method Three Node Method

PREVIOUS WORK• Hadoop portlet developed by BIFI

within the SCI-BUS project

• Liferay based portlet

• Submit Hadoop jobs in user specified clusters in OpenStack cloud

• User only needs to provide job executable and cluster configuration

• Easy to setup and use

• Front end based on Ajax web services

• Back end based on Java

• Standalone portlet, no integration with WS-PGRADE workflows

PREVIOUS WORK• Workflow integration could be achieved directly using the Hadoop portlet

• Bash script to submit, monitor and retrieve jobs from the portlet (a) Submit job to MapReduce portlet (b) Get JobId of submitted job from portlet (c) Get Job status and logs from portlet periodically until job is nished (d) Get output of job if job is successful

• Requires additional portlet to be installed on gateway

• Adds communication overhead

Hadoop

Portlet

Curl

Openstack Cloud

Openstack Java API

Shell Script

SINGLE NODE METHOD

Working:• Connect to OpenStack cloud and launch servers

• Connect to the master node server and setup cluster configuration

• Transfer input files and job executable to master node

• Start the Hadoop job by running a script in the master node

• When the job is finished, delete servers from OpenStack cloud and retrieve output if the job is successful

SINGLE NODE METHOD

THREE NODE METHOD

Working:• Stage 1 or Deploy Hadoop Node: Launch servers in

OpenStack cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration

• Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back

• Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources

THREE NODE METHOD

IMPLEMENTATION

IMPLEMENTATION – CLOUDBROKER PLATFORM

User ToolsUser Tools

Java Client Library*Java Client Library*

CloudBroker Platform*CloudBroker Platform*

…Cloud

ChemistryAppli-cations

BiologyAppli-

cations

PharmaAppli-cations

WebBrowser

UI*

…Appli-

cations

REST Web Service API*

End Users, Software Vendors, Resource ProvidersEnd Users, Software Vendors, Resource Providers

CLI*CLI*

EngineeringAppli-

cations

EngineeringAppli-

cations

Euca-lyptusCloud

Open-NebulaCloud*

Open-Stack

Cloud*

AmazonCloud*

CloudSigma Cloud*

Seamless access to heterogeneous cloud resources – high level interoperability

IMPLEMENTATION – WS-PGRADE/GUSE

• General purpose, workflow-oriented gateway framework

• Supports the development and execution of workflow-based applications

• Enables the multi-cloud and multi-grid execution of any workflow

• Supports the fast development of gateway instances by a customization technology

IMPLEMENTATION – SHIWA WORKFLOW REPOSITORY

• Workflow repository to store directly executable workflows

• Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc.

• Fully integrated with WS-PGRADE/gUSE

IMPLEMENTATION – SUPPORTED STORAGE SOLUTIONS

Local (user’s machine):

• Bottleneck for large files

• Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS

Swift:

• Two file transfers: Swift – Master node – HDFS

Amazon S3:

• Direct transfer from S3 to HDFS

• using Hadoop’s distributed copy application

Input/output locations can be mixed and matched in one workflow

EXPERIMENTS AND RESULTS

Testbed• CloudSME production gUSE (v 3.6.6) portal

• Jobs submitted using the CloudSME CloudBroker platform

• All jobs submitted to University of Westminster OpenStack Cloud

• Hadoop v2.5.1 on Ubuntu 14.04 trusty servers


Hadoop applications used for experiments• WordCount - the standard Hadoop example

• Rule Based Classification - A classification algorithm adapted for MapReduce

• Prefix Span - MapReduce version of the popular sequential pattern mining algorithm


Single node: Hadoop cluster created and destroyed multiple timesThree node: multiple Hadoop jobs between single create/destroy nodes


5 jobs on a 5 node cluster each, using WS-PGRADE parameter sweep featureSingle node method

Single Hadoop jobs on 5 node clusterSingle node method

CONCLUSION

• Solution works for any Hadoop application

• Proposed approach is generic and can be used for any gateway environment and cloud

• User can choose the appropriate method (Single or Three Node) according to his/her application

• Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously

• Can be used for large scale scientific simulations

Documents

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss