32
synergy.cs.vt.edu Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows Presented by Sarunya Pumma Supervisors: Dr. Wu-chun Feng, Dr. Mark Gardner, and Dr. Hao Wang

Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

synergy.cs.vt.edu

Aeromancer: A Workflow Manager for Large-Scale MapReduce-Based Scientific WorkflowsPresented by Sarunya Pumma Supervisors: Dr. Wu-chun Feng, Dr. Mark Gardner, and Dr. Hao Wang

Page 2: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Outline• Introduction & Motivation

• Cloud computing and its benefits

• SeqInCloud and scalability analysis

• Aeromancer

• Usage scenario

• Workflow and architecture

• Discussion

• Experiments & Initial Results

Page 3: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Introduction & Motivation

Page 4: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

What is Cloud?

Page 5: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

What is Cloud?

physical resources

CPUs Memory Storage Network

vCPUs vMemory vStorage vNetwork

virtual resources : resource pool

user a user b

I would like 6 virtual machines

for a month

I would like 3 virtual machines

for 2 months

Page 6: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

How is Cloud beneficial for Us?

Page 7: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

How is Cloud beneficial for Us?

Cost of sequencing a human-size genome decreases over time

2011 $95M 2012 $6.5k 2014 $1k

Next-generation sequencing (NGS) produces a huge amount

of DNA sequence data 2k sequencers produce 15 PB

genetic data/year{ } { }

Page 8: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

How is Cloud beneficial for Us?

Big Data

Page 9: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Cloud can accelerate our ability to identify the cure for cancer

How is Cloud beneficial for Us?

Page 10: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Provisioning compute and storage resources to analyze the exponentially increasing genomic data

Big Data Challenge

Page 11: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

SeqInCloud• It is implemented based on GATK (Genome Analysis

Toolkit)

• All six stages have to be executed in order

• Each stage is implemented as a small separated Hadoop program • It can be broken down into

small jobs and run on multiple computers/processors

• We expect a linear speed-up on every stage

Page 12: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

SeqInCloudGenome variant analysis pipeline

Align (BWA) .…..

Align (BWA)

Sort Sort .…..

Local-Realignment &

Sort

Local- Realignment &

Sort

...

MarkDuplicate MarkDuplicate …

...Count Covariate Count Covariate

Merge Covariates

Table Recalibration

Table Recalibration

...

Genotyper & Variant Filter

Genotyper & Variant Filter

...

Combine Variants Merge BAM File

FASTQ file

VCF file BAM file

Index Index .…..

1

2

3

4

5

6

Page 13: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

SeqInCloud Scalability AnalysisSp

eed

up

0

8

15

23

30

Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6

1 node 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes

1

16

32

8

32

1 1

Page 14: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

SeqInCloud Scalability Analysis• To run SeqInCloud efficiently on Hadoop

• Number of compute nodes must be varied based on the scalability of each stage

• Challenges

1. How many nodes do we need for each stage?

2. How to rapidly adjust a number of nodes with negligible overhead?

An automatic Hadoop workflow execution is needed!

Page 15: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Hadoop Workflow Manager

Cloud

Hadoop Workflow

(SeqInCloud)

Page 16: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Hadoop Workflow

(SeqInCloud)

AeromancerCloud

Page 17: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer• Platform for executing the scientific Hadoop workflows

• Based on Hadoop

• MapReduce

• HDFS

• Build on top of CloudGene

• Workflow (DAG)

• Node/Stage • Edge (represent dependency

between two stages)

Page 18: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer Usage Scenario

Add a new application to Aeromancer1

Page 19: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer Usage Scenario

YAML file

Application details

Public cloud details

Execution details - Execution command - Stage dependencies - Data dependencies

Page 20: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer Usage Scenario

Add a new application to Aeromancer1

Submit a new job2

Page 21: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer Usage Scenario

User Portal

Page 22: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer Usage Scenario

Add a new application to Aeromancer1

Submit a new job2Wait for the execution to

complete3

Page 23: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Cluster

Aeromancer Workflow & Architecture

Aeromancer Server

Aeromancer

Step Queue

Job Queue

DAG Processor

Worker Thread

Sub- Worker Thread

Spawn

Query

Enqueue

Spawn

Master

Hadoop

Templeton

Worker

Hadoop

Worker

Hadoop

Worker

Hadoop

Page 24: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Aeromancer Discussion• Currently, Aeromancer worker nodes are virtual machines in

the cloud• Pro

• Number of worker nodes are flexibly adjustable upon demand

• Con

• The overhead of creating/deleting worker nodes is high

• Execution speed drops significantly

Containerization may be able to improve Aeromancer’s performance

Page 25: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Virtualization Containerization

Reference: https://www.docker.com/whatisdocker

Page 26: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Experiment• Platforms

• Bare metal machines

• KVM

• Docker • Benchmarks (HiBench)

• Sort • WordCount • TeraSort

• Performance

• Speed (execution time)

• HDFS throughput

• Nutch Indexing

• PageRank

• EnhancedDFSIO

Page 27: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Experiment: Hadoop on Bare Metal Environment

ApplicationHistoryServer SecondaryNameNode

JobHistoryServer AmbariServer

NameNodeResourceManager

MiCloud2 (Master Machine)

DataNodeNodeManager

DataNodeNodeManager

MiCloud3 (Slave Machine)eth0

MiCloud5 (Slave Machine)

Page 28: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Experiment: Hadoop on Virtual Machine Environment

AmbariServer ApplicationHistoryServer SecondaryNameNode

JobHistoryServer

NameNodeResourceManager

Micloud2

MiCloud5 (Slave VM)

eth0br0

br0NodeManager

DataNode

NodeManagerDataNodebr0

MiCloud3 (Slave VM)

eth0: 192.168.4.x/24 br0: 192.168.254.x/24

(Master VM)

12 vCPUs 16 GB-vMemory

Page 29: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Experiment: Hadoop on Docker

AmbariServer ApplicationHistoryServer SecondaryNameNode

JobHistoryServer

NameNodeResourceManager

Micloud2

MiCloud5 (Slave Container)

eth0br0

br0NodeManager

DataNode

NodeManagerDataNodebr0

MiCloud3 (Slave Container)

eth0: 192.168.4.x/24 br0: 192.168.6.x/24

(Master Container)

12 vCPUs 16 GB-vMemory

Page 30: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Initial Results: Slow Down

0%

25%

50%

75%

100%

Sort Wordcount DFSIOE NutchIndexing TeraSort PageRank

94.8%

56.8%

97.4%

59.5%

98.6%95.4%

83%

9%

88%

14%

92%

62%

KVM Docker

Page 31: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

Future Work• Automatic resources provisioning

• Determining the number of computing nodes needed for each stage

• Based on the scalability of each stage

• Based on user’s constraints (budget/expected response time)

• Automatic task scheduling

• Determining the optimal execution location for each stage (either client or cloud)

Page 32: Aeromancer: A Workflow Manager for Large- Scale MapReduce ...research.cs.vt.edu/seec/2015-08-workshop/kwang-aeromancer-201… · 21/08/2015  · • Based on Hadoop • MapReduce

synergy.cs.vt.edu

Thank you

32