28
dask Going larger-than-memory and parallel with graphs Blake Griffith @cwlcks

Dask - Going larger-than-memory and parallel with graphs

Embed Size (px)

Citation preview

Page 1: Dask - Going larger-than-memory and parallel with graphs

daskGoing larger-than-memory and parallel with graphs

Blake Griffith @cwlcks

Page 2: Dask - Going larger-than-memory and parallel with graphs

Examples:

2

Page 3: Dask - Going larger-than-memory and parallel with graphs

3

Page 4: Dask - Going larger-than-memory and parallel with graphs

Graphs?

4

Page 5: Dask - Going larger-than-memory and parallel with graphs

5

Page 6: Dask - Going larger-than-memory and parallel with graphs

6

Page 7: Dask - Going larger-than-memory and parallel with graphs

Directed Acyclic Graphs (DAGs)

7

Page 8: Dask - Going larger-than-memory and parallel with graphs

8

edge

node

“Graph”

Page 9: Dask - Going larger-than-memory and parallel with graphs

9

edge

node

“Directed”

Page 10: Dask - Going larger-than-memory and parallel with graphs

10

edge

node

“Acyclic”

Page 11: Dask - Going larger-than-memory and parallel with graphs

11

Page 12: Dask - Going larger-than-memory and parallel with graphs

dask.arange(30, chunks=(10,)).sum()

12

('x_54',)

sum(...)

('x_53', 0) ('x_53', 2) ('x_53', 1)

sum

('arange­3', 0)

sum

('arange­3', 1)('arange­3', 2)

sum

arange arangearange

Page 13: Dask - Going larger-than-memory and parallel with graphs

Collections create graphs

13

Page 14: Dask - Going larger-than-memory and parallel with graphs

● dask.array● dask.dataframe● dask.bag● dask.imperative

14

Page 15: Dask - Going larger-than-memory and parallel with graphs

dask.array● Scalar math: +, *, exp, log, … 

● Reductions: sum(axis=0), mean(), std(), …

● Slicing, indexing: x[:100, 500:100:-2]

● Load from hdf5, and others

15

Page 16: Dask - Going larger-than-memory and parallel with graphs

dask.array limitations● NumPy API

● We always need to know the shape and dtype

● No argwhere(), nonzero(), etc.

16

Page 17: Dask - Going larger-than-memory and parallel with graphs

dask.dataframe● Element and rowise operations

● Shuffle operations

● Ingest data from CSV's, pandas, numpy,

17

Page 18: Dask - Going larger-than-memory and parallel with graphs

dask.dataframe limitations● Pandas API is huge.

● GIL

● Some things are hard to do in parallel, like sorting.

18

Page 19: Dask - Going larger-than-memory and parallel with graphs

dask.bag● Parallelize funcs across generic python objs.

● filter, fold, distinct, groupby

19

Page 20: Dask - Going larger-than-memory and parallel with graphs

dask.bag limitations● As slow as python

● Avoid groupby in favor of foldbys

● Multiprocessing scheduler

20

Page 21: Dask - Going larger-than-memory and parallel with graphs

dask.imperative● Do, Value

● Supports most operators

● Slicing

● Attribute access

● Method calls

21

Page 22: Dask - Going larger-than-memory and parallel with graphs

dask.imperative

22

Page 23: Dask - Going larger-than-memory and parallel with graphs

dask.imperative limitations● Shared resources are bad

● code idempotency, impurities

● Iteration

● In-place operations, mutations (setitem, +=).

● Predicate use: if a do

23

Page 24: Dask - Going larger-than-memory and parallel with graphs

Schedulers

24

Page 25: Dask - Going larger-than-memory and parallel with graphs

Schedulers● Synchronous● Threaded● Multiprocessing● Distributed

25

Page 26: Dask - Going larger-than-memory and parallel with graphs

Shared Memory● Threaded● Multiprocessing● synchronous

26

Page 27: Dask - Going larger-than-memory and parallel with graphs

Distributed Memory● beta● Easy to set up with anaconda cluster● Not very smart

27

Page 28: Dask - Going larger-than-memory and parallel with graphs

Dask distributed:● Workers● Scheduler● Clients

28

same network

scheduler

worker

worker worker

worker

client clientclient