38
New Capabilities in the PyData Ecosystem Peter Wang Continuum Analytics @pwang

New Capabilities in the PyData Ecosystem

Embed Size (px)

Citation preview

Page 1: New Capabilities in the PyData Ecosystem

New Capabilities in the PyData Ecosystem

Peter Wang Continuum Analytics @pwang

Page 2: New Capabilities in the PyData Ecosystem

Data Science @ NYT

Page 3: New Capabilities in the PyData Ecosystem

Python & SciPy

• High performance linear algebra, image processing, optimization via NumPy, optimized C++, FORTRAN

• Large structured data via HDF5, memmap • Out of core processing, streaming & realtime • Distributed computing via MPI, IPython Parallel, etc. • GPU & heterogenous via OpenCL, PyCUDA, others • Massive adoption in research, national labs, industry

(engineering, finance, etc.)

• IPython Notebook: 2005-2011 • pandas: 2008-2009 • scikit-learn: 2007 • NumPy: 2006 • matplotlib: 2002 • IPython: 2001 • Numarray: 2001 • SciPy: 1999 • Numeric: 1995

Python has >15 year history in scientific computing

Page 4: New Capabilities in the PyData Ecosystem

"Python's Scientific Ecosystem"

@jakevdp

Page 5: New Capabilities in the PyData Ecosystem

"Many More Tools"

@jakevdp

Page 6: New Capabilities in the PyData Ecosystem

Focus On

• Bokeh

• Dask

Page 7: New Capabilities in the PyData Ecosystem

Focus On

• Bokeh

• Dask

• Blaze, odo

• dynd

• xray

• NumPy

• Pandas

• PyTables & h5py

• Beaker Notebook

• IPython widgets, JupyterHub

• conda, Anaconda Cluster

• Docker

• Docker

• Docker

Not Gonna Talk About...

Page 8: New Capabilities in the PyData Ecosystem

Focus On

• Bokeh

• Dask

• Blaze, odo

• dynd

• xray

• NumPy

• Pandas

• PyTables & h5py

• Beaker Notebook

• IPython widgets, JupyterHub

• conda, Anaconda Cluster

• Docker

• Docker

• Docker

Not Gonna Talk About...

Page 9: New Capabilities in the PyData Ecosystem

Bokeh

• Interactive visualization

• Novel graphics

• Streaming, dynamic, large data

• For the browser, with or without a server

• No need to write Javascript

• Support for R, Scala, Julia, Lua

http://bokeh.pydata.org

Page 10: New Capabilities in the PyData Ecosystem

Dashboards & Data Apps

Page 11: New Capabilities in the PyData Ecosystem

Static Notebooks/HTML, Interactive Plots

http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction

Page 12: New Capabilities in the PyData Ecosystem

Extensible Architecture

server.py BrowserApp Model

BokehJS object graph

bokeh-serverbokeh.py object graph

JSON

Page 13: New Capabilities in the PyData Ecosystem
Page 14: New Capabilities in the PyData Ecosystem

rBokeh

http://hafen.github.io/rbokeh

Page 15: New Capabilities in the PyData Ecosystem

Dask

Page 16: New Capabilities in the PyData Ecosystem

Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/

data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day

Page 17: New Capabilities in the PyData Ecosystem

Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/

data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day

Page 18: New Capabilities in the PyData Ecosystem

Bigger Data

36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...

Page 19: New Capabilities in the PyData Ecosystem

Bigger Data

36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...

... better start chunking.

Page 20: New Capabilities in the PyData Ecosystem

DAG of Computation

Page 21: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python

Page 22: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework

Page 23: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem

Page 24: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling

Page 25: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Page 26: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Page 27: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas

Page 28: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism

Page 29: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism

Page 30: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality

Page 31: New Capabilities in the PyData Ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality• Distributed scheduler coming

Page 32: New Capabilities in the PyData Ecosystem

Simple Architecture

Page 33: New Capabilities in the PyData Ecosystem

Core Concepts

Page 34: New Capabilities in the PyData Ecosystem

dask.array: OOC, parallel, ND array

Arithmetic: +, *, ...

Reductions: mean, max, ...

Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)

Slightly overlapping arrays

Integration with HDF5

Page 35: New Capabilities in the PyData Ecosystem

dask.dataframe: OOC, parallel, ND array

Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

Page 36: New Capabilities in the PyData Ecosystem

More Complex Graphs

cross validation

Page 37: New Capabilities in the PyData Ecosystem

http://continuum.io/blog/xray-dask

Page 38: New Capabilities in the PyData Ecosystem

PyData's Future

• Dozens of international meetup groups • Intl conferences each year, including collab

with EuroPython, Strata, and others • More companies investing in the ecosystem

• Dato - SFrame, SGraph, ... • Cloudera - Impyla, Ibis, ... • Microsoft - Python in AzureML • Databricks - PySpark • Continuum - *.*