Upload
arabella-williams
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE
BASED APPLICATIONS IN THE CLOUDShashank GugnaniTamas Kiss
OUTLINE
• Introduction - motivations
• Approach
• Previous work
• Implementation
• Experiments and results
• Conclusion
INTRODUCTION
Hadoop• Open source implementation of the MapReduce framework introduced
by Google in 2004
• MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner
• Map: input data in divided into chunks and analysed on different nodes in a parallel manner
• Reduce: collating the work and combining the results into a single value
• Monitoring, scheduling and re-executing the failed tasks are the responsibility of MapReduce
• Originally for bare-metal clusters – popularity in cloud is growing
INTRODUCTION
Aim• Integration of Hadoop with workflow systems and science gateways
• Automatic setup of Hadoop software and infrastructure
• Utilization of the power of Cloud Computing
Motivation• Many scientific applications (like weather forecasting, DNA
sequencing, molecular dynamics) parallelized using the MapReduce framework
• Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists
INTRODUCTION
Motivation• CloudSME project• To develop a cloud-based simulation platform for manufacturing and engineering• Funded by the European Commission FP7 programme, FoF: Factories of the Future• July 2013 – December 2015• EUR 4.5 million overall funding• Coordinated by the University of Westminster• 29 project partners from 8 European countries• 24 companies (SMEs) and 5 academic/research institutions• Industrial use-case: datamining of aircraft maintenance data using Hadoop based
parallelisation
INTRODUCTION
APPROACH
• Set up a disposable cluster in the cloud, execute Hadoop job and destroy cluster
• Cluster related parameters and input files provided by user
• Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job
• Two methods proposed: Single Node Method Three Node Method
PREVIOUS WORK• Hadoop portlet developed by BIFI
within the SCI-BUS project
• Liferay based portlet
• Submit Hadoop jobs in user specified clusters in OpenStack cloud
• User only needs to provide job executable and cluster configuration
• Easy to setup and use
• Front end based on Ajax web services
• Back end based on Java
• Standalone portlet, no integration with WS-PGRADE workflows
PREVIOUS WORK• Workflow integration could be achieved directly using the Hadoop portlet
• Bash script to submit, monitor and retrieve jobs from the portlet (a) Submit job to MapReduce portlet (b) Get JobId of submitted job from portlet (c) Get Job status and logs from portlet periodically until job is nished (d) Get output of job if job is successful
• Requires additional portlet to be installed on gateway
• Adds communication overhead
Hadoop
Portlet
Curl
Openstack Cloud
Openstack Java API
Shell Script
SINGLE NODE METHOD
Working:• Connect to OpenStack cloud and launch servers
• Connect to the master node server and setup cluster configuration
• Transfer input files and job executable to master node
• Start the Hadoop job by running a script in the master node
• When the job is finished, delete servers from OpenStack cloud and retrieve output if the job is successful
SINGLE NODE METHOD
THREE NODE METHOD
Working:• Stage 1 or Deploy Hadoop Node: Launch servers in
OpenStack cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration
• Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back
• Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources
THREE NODE METHOD
IMPLEMENTATION
IMPLEMENTATION – CLOUDBROKER PLATFORM
User ToolsUser Tools
Java Client Library*Java Client Library*
CloudBroker Platform*CloudBroker Platform*
…Cloud
ChemistryAppli-cations
BiologyAppli-
cations
PharmaAppli-cations
WebBrowser
UI*
…Appli-
cations
REST Web Service API*
End Users, Software Vendors, Resource ProvidersEnd Users, Software Vendors, Resource Providers
CLI*CLI*
EngineeringAppli-
cations
EngineeringAppli-
cations
Euca-lyptusCloud
Open-NebulaCloud*
Open-Stack
Cloud*
AmazonCloud*
CloudSigma Cloud*
Seamless access to heterogeneous cloud resources – high level interoperability
IMPLEMENTATION – WS-PGRADE/GUSE
• General purpose, workflow-oriented gateway framework
• Supports the development and execution of workflow-based applications
• Enables the multi-cloud and multi-grid execution of any workflow
• Supports the fast development of gateway instances by a customization technology
IMPLEMENTATION – SHIWA WORKFLOW REPOSITORY
• Workflow repository to store directly executable workflows
• Supports various workflow system including WS-PGRADE, Taverna, Moteur, Galaxy etc.
• Fully integrated with WS-PGRADE/gUSE
IMPLEMENTATION – SUPPORTED STORAGE SOLUTIONS
Local (user’s machine):
• Bottleneck for large files
• Multiple file transfers: local machine – WS-PGRADE – CloudBroker – Bootstap node – Master node – HDFS
Swift:
• Two file transfers: Swift – Master node – HDFS
Amazon S3:
• Direct transfer from S3 to HDFS
• using Hadoop’s distributed copy application
Input/output locations can be mixed and matched in one workflow
EXPERIMENTS AND RESULTS
Testbed• CloudSME production gUSE (v 3.6.6) portal
• Jobs submitted using the CloudSME CloudBroker platform
• All jobs submitted to University of Westminster OpenStack Cloud
• Hadoop v2.5.1 on Ubuntu 14.04 trusty servers
EXPERIMENTS AND RESULTS
Hadoop applications used for experiments• WordCount - the standard Hadoop example
• Rule Based Classification - A classification algorithm adapted for MapReduce
• Prefix Span - MapReduce version of the popular sequential pattern mining algorithm
EXPERIMENTS AND RESULTS
Single node: Hadoop cluster created and destroyed multiple timesThree node: multiple Hadoop jobs between single create/destroy nodes
EXPERIMENTS AND RESULTS
5 jobs on a 5 node cluster each, using WS-PGRADE parameter sweep featureSingle node method
Single Hadoop jobs on 5 node clusterSingle node method
CONCLUSION
• Solution works for any Hadoop application
• Proposed approach is generic and can be used for any gateway environment and cloud
• User can choose the appropriate method (Single or Three Node) according to his/her application
• Parameter sweep feature of WS-PGRADE can be used to run Hadoop jobs with multiple input datasets simultaneously
• Can be used for large scale scientific simulations