From LINQ to DryadLINQ

Michael IsardWorkshop on Data-Intensive Scientific

Computing Using DryadLINQ

Overview

• From sequential code to parallel execution• Dryad fundamentals• Simple program example, plan for practicals

Distributed computation

• Single computer, shared memory– All objects always available for read and write

• Cluster of workstations– Each computer sees a subset of objects– Writes on one computer must be explicitly shared

• System automatically handles complexity– Needs some help

Data-parallel computation

• LINQ is high-level declarative specification• Same action on entire collection of objects• set.Select(x => f(x))– Compute f(x) on each x in set, independently

• set.GroupBy(x => key(x))– Group by unique keys, independently

• set.OrderBy(x => key(x))– Sort whole set (system chooses how)

Distributed cluster computing

• Dataset is stored on local disks of cluster

setset.0set.7

set.1set.6set.4

set.3set.2set.5

Distributed cluster computing

• Dataset is stored on local disks of cluster

set.0set.7

set.1set.6set.4

set.3set.2set.5

Simple distributed computation

var set2 = set.Select(x => f(x))

set.7set.1

set.6 set.4

set2.0

set2.1

set2.2

set2.3

set2.4

set2.5

set2.6

set2.7

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7

set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7

f f f f f f f f

Distributed acyclic graph

• Computation reads and writes along edges• Graph shows parallelism via independence• Goals of DryadLINQ optimizer– Extract parallelism (find independent work)– Control data skew (balance work across nodes)– Limit cross-computer data transfer

Distributed grouping

var groups = set.GroupBy(x => x.key)

• set is a collection of records each with a key• Don’t know what keys are present– Or in which partitions

• First, reorganize data– All records with same key on same computer

• Then can do final grouping in parallel

hash partition by key

group locally

groups

group locally

groups

a a ac

b bd d

a a ac

b bd d

Distributed sortingvar sorted = set.OrderBy(x => x.key)

range partition by key

sort locally

sorted

sample

compute histogram

sort locally

sorted

sample

compute histogram

[1,1][2,100]

1002 34

sort locally

sorted

sample

compute histogram

[1,1][2,100]

1002 34

1 1 1 1 2 3 4 100

Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

group locally

histogram

a bb a

a ad d

b db d

a bb a

a aa aa a

a ad d

b bd d

b d b db b

b db d

a bb a

group locally

histogram

a a a a a a

a a a a a a count

a bb a

a ad d

b db d

a bb a

a ad d

b db d

a bb a

a aa aa a

b bd d

b d b db b

b b b b b bd d d d

group locally

histograma,6 b,6d,4

a bb a

a ad d

b db d

a bb a

a ad d

b db d

a bb a

a a a a a a b b b b b bd d d d

a a a a a aa,6b,6d,4b b b b b b

d d d d

var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2a,2a,2

a,2d,2

b,2d,2

b,2 d,2b,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

group locally

histogram

a bb a

a ad d

b db d

a bb a

a,2b,2

a,2d,2

b,2d,2

a,2b,2

combine counts

group locallya,2b,2

a,2d,2

b,2d,2

a,2b,2

a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2

b,6 d,4

What Dryad does

• Abstracts cluster resources– Set of computers, network topology, etc.

• Schedule DAG: choose cluster computers– Fairly among competing jobs– So computation is close to data

• Recovers from transient failures– Rerun computations on machine or network fault– Speculate duplicates for slow computations

Resources are virtualized

• Each graph node is process– Writes outputs to disk– Reads inputs from upstream nodes’ output files

• Graph generally larger than cluster– 1TB input, 250MB partition, 4000 parts

• Cluster is shared– Don’t size program for exact cluster– Use whatever share of resources are available

What controls parallelism

• Initially based on partitioning of inputs

• After reorganization, system or user decides

DryadLINQ-specific operators

• set = PartitionedTable.Get<T>(uri)• set.ToPartitionedTable(uri)• set.HashPartition(x => f(x), numberOfParts)• set.AssumeHashPartition(x => f(x))• [Associative] f(x) { … }• RangePartition(…), Apply(…), Fork(…)• [Decomposable], [Homomorphic], [Resource]• Field mappings, Multiple partitioned tables, …

using System;using System.Collections.Generic;using System.Linq;using System.Text;using LinqToDryad;

namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); } }}

Form into groups

• 9 groups, one MSRI member per group• Try to pick common interest for project later

sherwood-246 — sherwood-253,sherwood-255

d:\dryad\data\Workshop\DryadLINQ\samplesCount, Points, Robots

Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe

TidyFS (file system) browserd:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe

From LINQ to DryadLINQ

Documents

WPH304. announcement LINQ to SQL LINQ to User Data

Python MapReduce Programming with Pydoop · DryadLINQ (C# + LINQ) Not MR, DAG model: vertices=programs, edges=channels ... Simone Leo Python MapReduce Programming with Pydoop. MapReduce

From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ

Professional - ELTE IK LINQ (Program… · LINQ to XML versus XmlReader 121 LINQ to XML versus XSLT 121 LINQ to XML versus MSXML 121 Summary 122 Chapter 6: Programming with LINQ to

Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]

Dryad and DryadLINQ Installation and Configuration Guide

Dryad and DryadLINQ Presented by Yin Zhu April 22, 2013 Slides taken from DryadLINQ project page: projects/dryadlinq/default.aspx

Cluster Computing with DryadLINQ

LINQ - Krzysztof Stencelstencel.mimuw.edu.pl/sem/sbd1/linq.pdf · • Składnia zaczerpnięta z SQL. ... Uniwersalność • LINQ to Objects • LINQ to XML • LINQ to SQL • LINQ

1 LinQ Introduction. Outline Goals of LinQ Anatomy of a LinQ query More expression examples LinQ to Objects LinQ to XML LinQ to SQL 2

Design Pattern for Scientific Applications in DryadLINQ CTP

DryadLINQ - classes.cs.uchicago.edu

Julián Enrique Verdezoto Celi. Conceptos Generales LINQ en objetosDEMO LINQ y XMLDEMO LINQ y SQLDEMO

Some sample programs written in DryadLINQ · 1.1 Displaying the contents of a text file ... DryadLINQ is a compiler which translates LINQ programs to distributed computations which

DryadLinq Job Browser

GANPAT UNIVERSITY...LINQ to Objects: Introduction to LINQ, LINQ Query Syntax, LINQ Query Operators, LINQ Query Methods 2 ASP.NET Core ASP.NET Core Introduction, Razor Syntax, Create

From (Visitor) to LINQ

Dryad and DryadLINQ

Programming clusters with DryadLINQ

Applicability of DryadLINQ to Scientific Applications