Upload
wes-mckinney
View
5.111
Download
0
Embed Size (px)
Citation preview
1 © Cloudera, Inc. All rights reserved.
Enabling Python to be a Be=er Big Data Ci?zen Wes McKinney @wesmckinn NYC Python Meetup 2016-‐02-‐17
2 © Cloudera, Inc. All rights reserved.
Me
• R&D at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba?ng)}
• Mostly work in Python and Cython/C/C++
3 © Cloudera, Inc. All rights reserved.
Industry Analy?cs Scien?fic Compu?ng
Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines
Homogeneous data Mul?dimensional arrays HPC tools Linear algebra Scien?fic data formats (e.g. HDF5) Fewer physical machines
Some simplis?c generaliza?ons
Python: heavy investment, generally
Python: light investment, generally
4 © Cloudera, Inc. All rights reserved.
A sample big data architecture
Kafka
Kafka
Kafka
Kafka
Application dataHDFS
JSON Spark/MapReduce
Columnar storage
Analytic SQL Engine
User
SQL
5 © Cloudera, Inc. All rights reserved.
pandas
• Hugely popular Python table / “data frame” library • Labeled table, array, and ?me series data structures
• Popular for data prepara?on, ETL, and in-‐memory analy?cs • Built using Python’s scien?fic compu?ng stack • User API / domain specific language • Bespoke in-‐memory analy?cs / rela?onal algebra engine • IO interfaces (CSV, SQL, etc.) • Expanded data type system (beyond NumPy)
• Supports flat data only (or semi-‐structured data that can be fla=ened)
6 © Cloudera, Inc. All rights reserved.
2016 Python Data Trends
• Improved Python interoperability with the Apache Hadoop ecosystem • I’m working with {Arrow, Kudu, Impala, Parquet, Spark}
• Support for big data file formats like Apache Parquet • Na?ve in-‐memory Python support for nested / JSON-‐like data
7 © Cloudera, Inc. All rights reserved.
Ibis in a nutshell
• For Python programmers doing analy?cs in industry • Project Blog: h=p://blog.ibis-‐project.org • Cross-‐team project @ Cloudera • Apache-‐licensed, open source h=p://github.com/cloudera/ibis • Craoing a compelling Python-‐on-‐Hadoop user experience • Remove SQL coding from user workflows • Develop high performance extensions in Python
8 © Cloudera, Inc. All rights reserved.
9 © Cloudera, Inc. All rights reserved.
Enabling interoperability with big data systems
• Distributed / MPP query engines: implemented in a host language • Typically C/C++ or Java/Scala
• User-‐defined func?ons (UDFs) through various means • Implement in host language • Implement in user language through some external language protocol (ooen RPC-‐based)
• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)
10 © Cloudera, Inc. All rights reserved.
Execu?ng data science languages in the compute layer
UIIbis, SQL, Spark API, …
ComputeAnalytic SQL, Spark, MapReduce
StorageHDFS, Kudu, HBase
Python, R, Julia, …?
11 © Cloudera, Inc. All rights reserved.
Python interoperability challenges
• Problem 1: Serializa?on / deserializa?on overhead
in partition 0
…
in partition n - 1
Big data system
Python function
input
Python function
input
User-supplied Python code
output
output
out partition 0
…
out partition n - 1
Big data system
12 © Cloudera, Inc. All rights reserved.
Data movement can be extremely costly
in partition 0 Python function
input
Ques:ons • How to represent “data in-‐flight” (RPC)? • Cost of conversion between in-‐memory data structures and RPC representa?on • How to communicate schemas / metadata?
13 © Cloudera, Inc. All rights reserved.
Data movement can be extremely costly
in partition 0 Python function
input
Slow data movement / conversion can largely undermine the performance benefits of Python’s
high performance in-‐memory data tools
14 © Cloudera, Inc. All rights reserved.
Python interoperability challenges
• Problem 2: Scalar vs vectorized computa?ons
result = np.empty(n)for i in range(n): result[i] = f(a[i], b[i])
result = f(a, b)
SCALAR
VECTORIZEDoften100-1000x faster
15 © Cloudera, Inc. All rights reserved.
Apache Arrow: What is it?
• h=p://arrow.apache.org • Not a piece of sooware, exactly! • A standardized in-‐memory representa?on for columnar data • Enables • Suitable for implemen?ng high-‐performance analy?cs in-‐memory (think like “pandas internals”) • Cheap data interchange amongst systems, li=le or no serializa?on • Flexible support for complex JSON-‐like data
• Targets: Impala, Kudu, Parquet, Spark
16 © Cloudera, Inc. All rights reserved.
Columnar data persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},
17 © Cloudera, Inc. All rights reserved.
Columnar data person.addresses.street
person.addresses
025
offset013610
abbcccddddf
person.addresses.number
23456
offset
18 © Cloudera, Inc. All rights reserved.
Apache Arrow in prac?ce
19 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own