From LINQ to DryadLINQ

Preview:

DESCRIPTION

From LINQ to DryadLINQ. Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ. Overview. From sequential code to parallel execution Dryad fundamentals Simple program example, plan for practicals. Distributed computation. Single computer, shared memory - PowerPoint PPT Presentation

Citation preview

From LINQ to DryadLINQ

Michael IsardWorkshop on Data-Intensive Scientific

Computing Using DryadLINQ

Overview

• From sequential code to parallel execution• Dryad fundamentals• Simple program example, plan for practicals

Distributed computation

• Single computer, shared memory– All objects always available for read and write

• Cluster of workstations– Each computer sees a subset of objects– Writes on one computer must be explicitly shared

• System automatically handles complexity– Needs some help

Data-parallel computation

• LINQ is high-level declarative specification• Same action on entire collection of objects• set.Select(x => f(x))– Compute f(x) on each x in set, independently

• set.GroupBy(x => key(x))– Group by unique keys, independently

• set.OrderBy(x => key(x))– Sort whole set (system chooses how)

Distributed cluster computing

• Dataset is stored on local disks of cluster

setset.0set.7

set.1set.6set.4

set.3set.2set.5

Distributed cluster computing

• Dataset is stored on local disks of cluster

set.0set.7

set.1set.6set.4

set.3set.2set.5

Simple distributed computation

var set2 = set.Select(x => f(x))

set

set2

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0

set.7set.1

set.6 set.4

set.3

set.2

set.5

set2.0

set2.1

set2.2

set2.3

set2.4

set2.5

set2.6

set2.7

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Simple distributed computation

var set2 = set.Select(x => f(x))

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Distributed acyclic graph

• Computation reads and writes along edges• Graph shows parallelism via independence• Goals of DryadLINQ optimizer– Extract parallelism (find independent work)– Control data skew (balance work across nodes)– Limit cross-computer data transfer

Distributed grouping

var groups = set.GroupBy(x => x.key)

• set is a collection of records each with a key• Don’t know what keys are present– Or in which partitions

• First, reorganize data– All records with same key on same computer

• Then can do final grouping in parallel

Distributed grouping

var groups = set.GroupBy(x => x.key)

set

hash partition by key

group locally

groups

ac

ad

db

ba

ac

a caa

ad

dd bb

db

ba

Distributed grouping

var groups = set.GroupBy(x => x.key)

set

hash partition by key

group locally

groups

ac

ad

db

ba

ac

a caa

ad

dd bb

db

ba

a a ac

b bd d

a a ac

b bd d

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

1001

11

23

31

41

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

1001

11

23

31

41

[1,1][2,100]

1001

11 11

1002 34

11

23

41

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

set

sample

compute histogram

1001

11

23

41

1001

11

23

41

[1,1][2,100]

1001

11 11

1002 34

11

23

41

1 1 1 1 2 3 4 100

1 1 1 1 2 3 4 100

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a bb a

a aa aa a

a ad d

b bd d

b d b db b

b db d

a bb a

count

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a a a a a a

a a a a a a count

a bb a

a ad d

b db d

a bb a

a bb a

a ad d

b db d

a bb a

a aa aa a

b bd d

b d b db b

b b b b b bd d d d

b b b b b bd d d d

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histograma,6 b,6d,4

count

a bb a

a ad d

b db d

a bb a

a bb a

a ad d

b db d

a bb a

a a a a a a b b b b b bd d d d

a a a a a aa,6b,6d,4b b b b b b

d d d d

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

set

hash partition by key

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2d,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

a,6

a,6

b,6 d,4

b,6 d,4

What Dryad does

• Abstracts cluster resources– Set of computers, network topology, etc.

• Schedule DAG: choose cluster computers– Fairly among competing jobs– So computation is close to data

• Recovers from transient failures– Rerun computations on machine or network fault– Speculate duplicates for slow computations

Resources are virtualized

• Each graph node is process– Writes outputs to disk– Reads inputs from upstream nodes’ output files

• Graph generally larger than cluster– 1TB input, 250MB partition, 4000 parts

• Cluster is shared– Don’t size program for exact cluster– Use whatever share of resources are available

What controls parallelism

• Initially based on partitioning of inputs

• After reorganization, system or user decides

DryadLINQ-specific operators

• set = PartitionedTable.Get<T>(uri)• set.ToPartitionedTable(uri)• set.HashPartition(x => f(x), numberOfParts)• set.AssumeHashPartition(x => f(x))• [Associative] f(x) { … }• RangePartition(…), Apply(…), Fork(…)• [Decomposable], [Homomorphic], [Resource]• Field mappings, Multiple partitioned tables, …

using System;using System.Collections.Generic;using System.Linq;using System.Text;using LinqToDryad;

namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); } }}

Form into groups

• 9 groups, one MSRI member per group• Try to pick common interest for project later

sherwood-246 — sherwood-253,sherwood-255

d:\dryad\data\Workshop\DryadLINQ\samplesCount, Points, Robots

Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe

TidyFS (file system) browserd:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe

Recommended