Upload
continuum-analytics
View
1.791
Download
3
Embed Size (px)
Citation preview
daskGoing larger-than-memory and parallel with graphs
Blake Griffith @cwlcks
Examples:
2
3
Graphs?
4
5
6
Directed Acyclic Graphs (DAGs)
7
8
edge
node
“Graph”
9
edge
node
“Directed”
10
edge
node
“Acyclic”
11
dask.arange(30, chunks=(10,)).sum()
12
('x_54',)
sum(...)
('x_53', 0) ('x_53', 2) ('x_53', 1)
sum
('arange3', 0)
sum
('arange3', 1)('arange3', 2)
sum
arange arangearange
Collections create graphs
13
● dask.array● dask.dataframe● dask.bag● dask.imperative
14
dask.array● Scalar math: +, *, exp, log, …
● Reductions: sum(axis=0), mean(), std(), …
● Slicing, indexing: x[:100, 500:100:-2]
● Load from hdf5, and others
15
dask.array limitations● NumPy API
● We always need to know the shape and dtype
● No argwhere(), nonzero(), etc.
16
dask.dataframe● Element and rowise operations
● Shuffle operations
● Ingest data from CSV's, pandas, numpy,
17
dask.dataframe limitations● Pandas API is huge.
● GIL
● Some things are hard to do in parallel, like sorting.
18
dask.bag● Parallelize funcs across generic python objs.
● filter, fold, distinct, groupby
19
dask.bag limitations● As slow as python
● Avoid groupby in favor of foldbys
● Multiprocessing scheduler
20
dask.imperative● Do, Value
● Supports most operators
● Slicing
● Attribute access
● Method calls
21
dask.imperative
22
dask.imperative limitations● Shared resources are bad
● code idempotency, impurities
● Iteration
● In-place operations, mutations (setitem, +=).
● Predicate use: if a do
23
Schedulers
24
Schedulers● Synchronous● Threaded● Multiprocessing● Distributed
25
Shared Memory● Threaded● Multiprocessing● synchronous
26
Distributed Memory● beta● Easy to set up with anaconda cluster● Not very smart
27
Dask distributed:● Workers● Scheduler● Clients
28
same network
scheduler
worker
worker worker
worker
client clientclient