Enabling Python to be a Better Big Data Citizen

1 © Cloudera, Inc. All rights reserved.

Enabling Python to be a Be=er Big Data Ci?zen Wes McKinney @wesmckinn NYC Python Meetup 2016-‐02-‐17


Me

• R&D at Cloudera, formerly DataPad CEO/founder •  Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba?ng)}

• Mostly work in Python and Cython/C/C++


Industry Analy?cs Scien?fic Compu?ng

Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines

Homogeneous data Mul?dimensional arrays HPC tools Linear algebra Scien?fic data formats (e.g. HDF5) Fewer physical machines

Some simplis?c generaliza?ons

Python: heavy investment, generally

Python: light investment, generally


A sample big data architecture

Kafka

Kafka

Kafka

Kafka

Application dataHDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL


pandas

• Hugely popular Python table / “data frame” library • Labeled table, array, and ?me series data structures

• Popular for data prepara?on, ETL, and in-‐memory analy?cs • Built using Python’s scien?fic compu?ng stack • User API / domain specific language • Bespoke in-‐memory analy?cs / rela?onal algebra engine •  IO interfaces (CSV, SQL, etc.) • Expanded data type system (beyond NumPy)

•  Supports flat data only (or semi-‐structured data that can be fla=ened)


2016 Python Data Trends

•  Improved Python interoperability with the Apache Hadoop ecosystem •  I’m working with {Arrow, Kudu, Impala, Parquet, Spark}

•  Support for big data file formats like Apache Parquet • Na?ve in-‐memory Python support for nested / JSON-‐like data


Ibis in a nutshell

•  For Python programmers doing analy?cs in industry • Project Blog: h=p://blog.ibis-‐project.org • Cross-‐team project @ Cloudera • Apache-‐licensed, open source h=p://github.com/cloudera/ibis • Craoing a compelling Python-‐on-‐Hadoop user experience • Remove SQL coding from user workflows • Develop high performance extensions in Python



Enabling interoperability with big data systems

• Distributed / MPP query engines: implemented in a host language • Typically C/C++ or Java/Scala

• User-‐defined func?ons (UDFs) through various means •  Implement in host language •  Implement in user language through some external language protocol (ooen RPC-‐based)

• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)


Execu?ng data science languages in the compute layer

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?


Python interoperability challenges

• Problem 1: Serializa?on / deserializa?on overhead

in partition 0

…

in partition n - 1

Big data system

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

…

out partition n - 1

Big data system


Data movement can be extremely costly

in partition 0 Python function

input

Ques:ons •  How to represent “data in-‐flight” (RPC)? •  Cost of conversion between in-‐memory data structures and RPC representa?on •  How to communicate schemas / metadata?


Data movement can be extremely costly

in partition 0 Python function

input

Slow data movement / conversion can largely undermine the performance benefits of Python’s

high performance in-‐memory data tools


Python interoperability challenges

• Problem 2: Scalar vs vectorized computa?ons

result = np.empty(n)for i in range(n): result[i] = f(a[i], b[i])

result = f(a, b)

SCALAR

VECTORIZEDoften100-1000x faster


Apache Arrow: What is it?

• h=p://arrow.apache.org • Not a piece of sooware, exactly! • A standardized in-‐memory representa?on for columnar data • Enables • Suitable for implemen?ng high-‐performance analy?cs in-‐memory (think like “pandas internals”) • Cheap data interchange amongst systems, li=le or no serializa?on • Flexible support for complex JSON-‐like data

• Targets: Impala, Kudu, Parquet, Spark


Columnar data persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},


Columnar data person.addresses.street

person.addresses

025

offset013610

abbcccddddf

person.addresses.number

23456

offset


Apache Arrow in prac?ce


Thank you Wes McKinney @wesmckinn Views are my own

Technology

Enabling Python to be a Better Big Data Citizen