68
Wide-SIMD Parallelization of Streaming Dataflow, with Applications to Bioinformatics Jeremy Buhler For CSE 591 NSF Awards CNS-1500173, CNS-1763503

Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Wide-SIMD Parallelization of Streaming Dataflow, with

Applications to Bioinformatics

Jeremy Buhler For CSE 591

NSF Awards CNS-1500173, CNS-1763503

Page 2: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Take-Home Message

• Biological sequence analysis is a source of high-impact computational problems

• Using SIMD parallel computing for these problems requires dealing with irregularity

• MERCATOR is an ongoing research effort to make irregular application development on SIMD platforms easier.

2

Page 3: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Who Am I? • I study how to accelerate high-

impact bioinformatics problems.

• One way to do this is via parallelization on modern architectures (FPGAs, GPUs, …)

• Along the way, many interesting CS questions… – Streaming computation [FCCM’07, JVSP’07,M&M’09] – Systolic array design [FPL’09,FCCM’10,ASAP’10] – Deadlock avoidance [SPAA’10,PPoPP’12,DFM’13,JPDC’17] – SIMD mapping [ISPDC’14,DFM’15,HPCS’17]

3

Page 4: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Talk Overview

• Problems: DNA comparison and read mapping

• Algorithmic approach – Why SIMD?

• MERCATOR overview and performance

• Research challenges

4

Page 5: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Molecular Biology is Fundamental

• Genetic basis of disease and disease risk

• Systems biology – what are your cells doing?

• Studying natural history and evolution

• Engineering cells’ behavior for medicine, industry, agriculture

5

Page 6: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

The First Step: DNA Sequencing

• Sequencing can tell us what is in a genome…

• … but also the basis of experiments to probe gene expression, protein binding, chromosome

conformation, epigenetic marks, polymorphism, copy number variants …

…acaggatagtaccgataccat cacccggataggacctatgag ggacacaggacttatggcattt… 6

Page 7: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

7

Page 8: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

Billi

ons o

f DN

A Ba

ses

NIH Sequence Read Archive Size (Open Access Only)

http://www.ncbi.nlm.nih.gov/Traces/sra/

8

Page 9: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Problem: Classical Similarity Search

• Given – a genome-sized or larger DNA

sequence database D – a “query” sequence q of some

length L << |D|

• Does q appear in D with at most k differences, and if so, where?

9

Page 10: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Typical Parameters

• Database D has size 109 – 1010 bases

• Query q has size 102 – 104 bases

• # differences k is 5-25% of |q| (bases added, deleted, changed)

10

Page 11: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Tools for Similarity Search

• BLAST [Altschul et al. 1990, 1996]

• BLAT [Kent 2002]

These tools use variants of same basic search algorithm.

11

Page 12: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Problem: Short-Read Mapping

• Given – a genome-sized or larger DNA

sequence database D – N “reads” – DNA seqs of some

length L << |D|

• For each read, does it appear in D with at most k differences, and if so, where?

12

Page 13: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Typical Parameters

• Database D has size 109 – 1010 bases

• Number of reads N is 106 – 108

• Length L is 75-150 (may vary among reads)

• # differences k is 0-3 (added, deleted, changed)

13

Page 14: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Tools for Mapping

• Bowtie [Langmead et al. 2009, 2012]

• BWA [Li & Durbin 2009, 2010]

• SOAP2 [Li et al. 2009]

All these tools use variants of same basic search algorithm.

14

Page 15: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Why Short-Read Mapping?

• Some experimental procedure selects a subset of everything in the database

• Reads are sampled from this subset by your sequencing machine

• Mapping tells you which parts of database are present in your sample

15

Page 16: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Problem: Alignment-Free Organism ID

• Given – a metagenome-sized DNA

sequence database D – N microbial genomes – DNA seqs

of some length L << |D|

• For each genome, do (some of) its sequences appear in D?

16

Page 17: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Alignment-Free Techniques

• Min-Hash Sketching – convert a seq to a small sample (m ~ 1000) of hash values

• Approximate Containment: how much of (the sketch for) a genome overlaps (the sketch for) a metagenome?

• MASH (Ondov et al. 2016) • SourMASH (Brown et al. 2016)

17

Page 18: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Talk Overview

• Problems: DNA alignment and read mapping

• Algorithmic approach – Why SIMD?

• MERCATOR overview and performance

• Research challenges

18

Page 19: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

How BLAST Works

19

Substring

Matching

Gapped

Filter Ungapped

Filter

SEQUENCES REMAINING

COMPUTATIONAL COST

BLAST operates as a pipeline of computational stages.

Page 20: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Stages of BLAST

• Stage 1: identify potential match locations between q, D

• Stage 2: keep only those locations that look somewhat promising

• Stage 3: keep only those locations that actually yield high-similarity alignments

20

Page 21: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Generating Possible Matches

• Every place where some 11-mer from q matches an 11-mer from D exactly is a candidate.

• Can rapidly find all such matching locations using hash table of 11-mers in sequence q

21

accagatacatagcactcgctacgtcagatgggtaca gttaagtcagatgggtagactcaggatgacagtggaca

Page 22: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Filtering Candidates

• Uses explicit edit distance computation between q, part of D (Smith-Waterman algo)

• Expensive dynamic programming!

• “Easy” version (substitutions only), followed by hard version (add/delete chars allowed)

22

Page 23: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

BLAST Parallelization

• Can generate candidates in parallel at each DB location, then filter them in parallel.

23

Gen Candidates

Filter Ungapped

Filter Gapped

Page 24: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

What About Read Mapping?

• Uses an index (virtual suffix tree) of database

• Matching involves tracing a path down index tree for each read

• (must try several paths if differences are allowed)

• Can do in parallel for many reads at once!

24

Page 25: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Suffix Tree Example

D = acagaccaga$ 0 1 2 3 4 5 6 7 8 9 10

10 $

9 a$

0 aca…

4 acc…

7 aga$

2 agac…

6 caga$

5 cca…

1 cagac…

8 ga$

3 gac…

A

9 0 4 7 2 6 1 5 8 3

a c g a

$ c g a

a g a

c

$ c

$ c

a c $ c

… …

T

25

Page 26: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Rapid Matching vs Suffix Tree

• Can find all matches to a read in D in time proportional to read length L.

9 0 4 7 2 6 1 5 8 3

a c g a

$ c g a

a g a

c

$ c

$ c

a c $ c

… …

Where is cag?

26

Page 27: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Rapid Matching vs Suffix Tree

• Can find all matches to a read in D in time proportional to read length L.

9 0 4 7 2 6 1 5 8 3

a c g a

$ c g a

a g a

c

$ c

$ c

a c $ c

… …

Where is cag?

27

Page 28: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Rapid Matching vs Suffix Tree

• Can find all matches to a read in D in time proportional to read length L.

9 0 4 7 2 6 1 5 8 3

a c g a

$ c g a

a g a

c

$ c

$ c

a c $ c

… …

Where is cag?

28

Page 29: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Rapid Matching vs Suffix Tree

• Can find all matches to a read in D in time proportional to read length L.

9 0 4 7 2 6 1 5 8 3

a c g a

$ c g a

a g a

c

$ c

$ c

a c $ c

… …

Where is cag?

29

Page 30: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Rapid Matching vs Suffix Tree

• Can find all matches to a read in D in time proportional to read length L.

9 0 4 7 2 6 1 5 8 3

a c g a

$ c g a

a g a

c

$ c

$ c

a c $ c

… …

Where is cag?

D = acagaccaga$ 0 1 2 3 4 5 6 7 8 9 10

30

Page 31: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Extension to Inexact Matching

• To permit matches with k substitutions, try multiple paths, but charge for each mismatch.

• To permit matches with k differences, we do dynamic programming to compute edit distance of read against each path in tree.

• Descent stops for a read when we hit bottom of tree or find that path requires > k differences.

31

Page 32: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Parallel Alignment is a SIMD Computation

• We process every BLAST starting loc / every read through same filtering computation

• Single Instruction stream, Multiple Data items

Thread 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Page 33: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

SIMD Targets

• Our work: NVIDIA GPUs (32 SIMD lanes x 4+ threads x 2-64 cores)

• Other possibilities: any multicore with wide vector instructions (Intel Xeon, AMD, ARM, …)

• ~All modern processors have wide SIMD!

33

Page 34: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal (Short Reads)

34 9 0 4 7 2 6 1 5 8 3

Page 35: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

35 9 0 4 7 2 6 1 5 8 3

Page 36: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

36 9 0 4 7 2 6 1 5 8 3

x x x

Some reads may accumulate > k diffs before others

Page 37: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

37 9 0 4 7 2 6 1 5 8 3

x x x x x

Page 38: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

38 9 0 4 7 2 6 1 5 8 3

x x x x x x

Stop descending when all reads either have > k diffs or are completely matched with fewer diffs

Page 39: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

39 9 0 4 7 2 6 1 5 8 3

x x x x x x

Continue on next branch starting from batch on top of stack

Page 40: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

40 9 0 4 7 2 6 1 5 8 3

x x x x x x

x x x x x x x x

Page 41: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

41 9 0 4 7 2 6 1 5 8 3

x x x x x x

Page 42: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Batched Traversal

42 9 0 4 7 2 6 1 5 8 3

x x x x

Page 43: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Performance?

• Each stage of BLAST costs more but processes less input.

• 98% of threads idle for 110/111 ms • 1.99% of threads idle for 100/111 ms • SIMD EFFICIENCY: 1.1%

43

Substring

Matching

Gapped

Filter Ungapped

Filter

1 ms 10 ms 100 ms

100% 2% 0.01%

Page 44: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Irregular Computations

• DNA alignment is an irregular computation: different inputs (i.e. DB locations, reads) require different amounts of work to process.

• Antithesis of, e.g., linear algebra calculations that are easily vectorized

• Irregular computations are highly inefficient if naively implemented on SIMD processors.

44

Page 45: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

The Key Problem

• How can we efficiently map irregular computations onto a SIMD architecture?

45

Page 46: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Talk Overview

• Problems: DNA alignment and read mapping

• Algorithmic approach – Why SIMD?

• MERCATOR overview and performance

• Research challenges

46

Page 47: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

47

Pause for MERCATOR demo

Page 48: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

MERCATOR Paradigm

• Application processes a long stream of inputs

• Application graph consists of nodes (computations), edges (data transfer)

• Data flows through graph of computations

• Irregularity: paths differ per input, each input to a node generates 0, 1, or multiple outputs 48

Page 49: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Handling Irregularity

• Each edge between nodes has a queue

• MERCATOR queues inputs to a node until there are enough to fill all its SIMD lanes

• Node is only fired when it has “full ensemble” of inputs in all lanes.

49

Page 50: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

50

Page 51: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

51

x

x

Page 52: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

52

Page 53: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

53

Page 54: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

54

x

x x

Page 55: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

55

Page 56: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

56

Page 57: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

57

x

x

Page 58: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

58

Page 59: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Illustration of Queues

59

Page 60: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

A Few Complications

• Shared Code – two or more nodes may do same thing (e.g. Viola-Jones)

• Overhead – queueing isn’t free

• Asynchrony – must use multiple processors, each with multiple SIMD lanes

• Ordering – are inputs processed “in order”?

60

Page 61: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Exploiting Shared Code

• “Module type” CUDA code

• Multiple nodes with same function have same module type

• We execute all nodes of a given module type in parallel!

• [Requires pulling data from each node’s queue concurrently]

61

Page 62: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Minimizing Overhead

• Queue manipulation is itself parallelized

• Easy case: “read next k inputs from queue into threads 1..k.”

• More fun: “read k total inputs from all queues combined into threads 1..k, and remember which queue each input came from.”

62

Page 63: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Sneaky Tricks

• Parallel scan

• Branch-free binary search

• Parallel output compaction

• [exploits, maintains input ordering] 63

Page 64: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Results of Synthetic Trial

64

Page 65: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Dealing with Asynchrony

• Shared input / output buffers

• Output order with multiple processors?

• [Need stream-synchronized signaling]

• Associative (and commutative?) reductions

65

Page 66: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Applications with Cycles

• App graph can have back edges

• Issue: deadlock prevention

• [topology restrictions, queueing policy]

• Order preservation?

66

Page 67: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Optimization Opportunities

• Parameter tuning (queue sizes, scheduler, …)

• Latency-sensitive applications vs occupancy

• Fusing nodes to elide queueing (at what cost to occupancy?)

67

Page 68: Wide-SIMD Parallelization of Streaming Dataflow, with ...jain/cse591-18/ftp/buhler591.pdf · Application graph consists of nodes (computations), edges (data transfer) •Data flows

Want to Play?

• https://github.com/jdbuhler/mercator

• MERCATOR will be a testing ground for SIMD-aware irregular streaming computation

• Many interesting problems still to be solved!

• thesis topics

68