27
6th June, 2007 WWWFG 2007 Wildfire Distributed, Grid-Enabled Workflow Construction and Execution Arun Krishnan, PhD Assistant Professor, Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan

Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

  • Upload
    allie

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Arun Krishnan, PhD Assistant Professor, Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan. Wildfire Distributed, Grid-Enabled Workflow Construction and Execution. Affordable HPC Commodity hardware assembled into Beowulf clusters Pooled hardware in Grids - PowerPoint PPT Presentation

Citation preview

Page 1: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

WildfireDistributed, Grid-Enabled Workflow

Construction and Execution

Arun Krishnan, PhDAssistant Professor,

Institute for Advanced Biosciences,Keio University, Tsuruoka, Japan

Page 2: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Two Trends

● Bioinformatics analysis

● Increasingly complex analyses

● Several bioinformatics applications assembled into workflows

● Affordable HPC● Commodity hardware

assembled into Beowulf clusters

● Pooled hardware in Grids

● Parallel by designTraditional solution: implement workflows as

perl scripts.

Difficult to program.

Difficult to maintain.

Difficult to port.

Page 3: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

The Problem

● We need● Tool for construction and execution of

workflows on supercomputers● User-interface must be intuitive for non-HPC-

specialists● Execution must support different

supercomputing platforms

Page 4: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Solution

● Objectives:● Coarse-grained parallel programming for Grid● Exploit heterogeneity of Grid (s/w licences, data, h/w)

● Approach:● An expressive workflow description language, GEL

● Sequential and parallel composition● Conditional execution (if-then-else)● Sequential iteration (while loop)● Parameterised parallel composition (parameter sweeps)● Parameterised sequential composition

Basic Idea: Can we do on the grid, what we do using shell scripts on a cluster??

for i in `ls .`; do blastp -d yeast -I $i; done

Page 5: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

GEL: An OverviewSemantics

● A workflow has● One input directory● One or more output directories

● Workflow cannot modify its input directory

Page 6: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

● Job (atomic workflow subunit)● Characteristics

● Executable name● Resource/system/software/data requirements● One input directory● One output directory

● Semantics● Stage files into input directory● Run executable● Present output directory as result

GEL: An OverviewSemantics (Job)

Page 7: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

● Conditional (if E then A else B)● E is a job for which we ignore the output files● A and B are workflows● Executing (if E then A else B) entails

● Execute E and observe stdout● If stdout is non-empty then execute A● If stdout is empty then execute B

GEL: An OverviewSemantics (Conditional)

Page 8: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

● Sequential composition (A;B)● Execute A● Copy files from all output directories of A into input

directory of B● Execute B● Note: implicit merge of output directories of A

● Parallel composition (A||B)● Execute A and B from input directories populated from the

same files● Output directories are those of A and B

GEL: An OverviewSemantics (seq, par

compn)

Page 9: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

● Sequential iteration (while E do A)● E is a job for which we ignore the output files● (Standard while defn) Executing

while E do A

is semantically equivalent to executing

if E then (A; while E do A)

GEL: An OverviewSemantics (Sequential

Iteration)

Page 10: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

● Parameterised parallel composition (pfor x in xs do A(x))

● xs is a list expression● E.g. 0:50:10 = [0, 10, 20, 30, 40, 50]

● Variable x is a bound variable● Executing

pfor x in (a0,xs) do A(x)is semantically equivalent to executing

A(a0) || (pfor x in xs do A(x))

GEL: An OverviewSemantics (Parametrized par

compn)

Page 11: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

● Parameterised sequential composition (for x in xs do A)

● xs is a list expression, x is a bound variable● Executing

for x in (a0,xs) do A(x)is semantically equivalent to executing

A(a0); (pfor x in xs do A(x))● Know number of iterations before executing loop

(cf. while loop)

GEL: An OverviewSemantics (Parametrized seq

compn)

Page 12: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

So what did we do…?

● Grammar defined● Sequential and parallel composition● Sequential iteration● Intrinsic jobs (e.g. file projection)

● Interpretors implemented● Local machine: spawn jobs locally● Clusters: spawn jobs using SGE,PBS and LSF● Statically-scheduled Grid interpretor: GridFTP

staging, GramJob spawn● Required

● A GUI frontend

Page 13: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Wildfire…

Wildfire and GEL brings supercomputing power to the bioinformatician

Page 14: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Features

● Integrated environment

● Construct and execute workflows from the same interface

● User-friendly● Drawing-analogy

workflow construction● Program options

presented using Jemboss-style drop-down lists, buttons, textboxes, etc.

● Supercomputing support

● Shared memory multiprocessors

● Cluster schedulers● PBS● SGE● LSF

● Grids● Globus

Page 15: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

W/F Construction: Drawing

● Double click on components to change options

● Draw arrows between components● Drag components into containers

Yellow boxes are atomic components

Parallel bars denote parallel container

Parallel “foreach” repeats contents for each file matching pattern

An arrow denotes sequential dependence

Page 16: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

W/F Construction: Components

● Wildfire has been pre-configured with EMBOSS applications

● Custom/new components can be added

Page 17: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

W/F Execution

Globus

GRID

GEL datadatadatadata

User uses Wildfire to create workflow as GEL script

Execution on(1) Grid,(2) Cluster, or(3) local

LSF

Cluster

fork

Laptop

Page 18: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

W/F Execution: GEL

• “Local”/SMP– Run programs

directly– Use multiple

processors if available

• Grid– Stage files using

GridFTP– Execute programs

using GRAM

• Beowulf Cluster– Submit job

requests through queue manager

– Use processors on compute nodes

– Use job dependencies

• PBS/Torque• Sun GridEngine

(SGE)• Platform LSF

Page 19: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

More workflow parallel features

Parallel container●Denotes independent components●Whole container is considered a component

Parallel for loop●Loop variable $i iterates over values 0 to 3●For each value of $i, an instance of its contents executes in parallel

Page 20: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Workflow: while loops

Component inside round disc is the loop guard

If loop guard evaluates to false, then the break branch is taken

If loop guard is true, then the true branch is taken, after which the loop guard is evaluated again

While loop allows for iterative workflows

Page 21: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Ex: Transcript Analysis

● Transcripts database from Mammalian Gene Collection

● Exons from chromosomes from NCBI Genbank

● Blast each exon against transcript database to investigate splicing of transcripts

Chromosome

Exons Transcripts

BLAST

Extract

Alignments

Page 22: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

Transcript Analysis:24 Chromosomes

● Human genome has 24 chromosomes (1-22,X,Y)

● How do we leverage parallel computing?

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

C h r o m o s o m e

E x o n s T r a n s c r i p t s

B L A S T

E x t r a c t

A l i g n m e n t s

Page 23: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Transcript Analysis:Parallelism

Dice splits big exons file into several smaller filesC/some

Exons

Extract

Dice

Exons

BlastBlastBlast

Results

Transcripts

Separate BLAST instances align the smaller files against transcript database

Alignments are stored in many files

One copy per chromosome

C/some

Exons

Extract

Dice

Exons

BlastBlastBlast

Results

Transcripts

C/some

Exons

Extract

Dice

Exons

BlastBlastBlast

Results

Transcripts

C/some

Exons

Extract

Dice

Exons

BlastBlastBlast

Results

Transcripts

C/some

Exons

Extract

Dice

Exons

BlastBlastBlast

Results

Transcripts

C/some

Exons

Extract

Dice

Exons

BlastBlastBlast

Results

Transcripts

Page 24: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Transcript Analysis:Workflow Parallel “foreach”

container executes inner pipeline once for each file matching *.gbk.gzDecompress

chromosome data

Decompress transcripts file

Extract exons

Format database for BLAST query

Break up exons file into smaller files

Parallel “foreach” container executes blastall component for all files matching *_dice*.fna

Page 25: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Transcript Analysis:Execution Profile

● The execution profile shows when programs start and stop

● Note: “makespan” can be improved by balancing the duration of blast jobs (modify dice)

Page 26: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Summary

• End-User Requirements– Ease of construction– Ease of implementation– Ease of recovery

• Grid Scripting the way to go?

• Interfaces to grid-scripting?

Page 27: Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

6th June, 2007 WWWFG 2007

Acknowledgments

• Bioinformatics Institute, Singapore

• Dr. Francis Tang

• Chua Ching Lian

• Liang-Yoong Ho