Download pdf - Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney

1 © Cloudera, Inc. All rights reserved.

Ibis: Scaling Python Analy=cs on Hadoop and Impala Wes McKinney, SF Data Mining Meetup 2015-‐10-‐22 @wesmckinn


Me

• R&D at Cloudera •  Serial creator of structured data tools / user interfaces • Mathema=cian — MIT ‘07 •  “Professional SQL programmer” 2007-‐2010 (@ AQR) • Created pandas (Python library) in 2008 • Wrote bestseller Python for Data Analysis 2012 •  Founder of DataPad


Python is popular…

• Python has become a standard language of data science • Why is it popular? • Maximizes produc=vity for data engineers and data scien=sts • Build robust sobware and do interac=ve data analysis with 100% Python code • Easy-‐to-‐learn and makes happy and produc=ve data teams • Large, diverse open source development community • Comprehensive libraries: data wrangling, ML, visualiza=on, etc.

• Main use case: data science & engineering swiss army knife on small-‐to-‐medium size data


…but Python does not scale today

• Python ecosystem confined to single-‐node analysis • Great for smaller data sets • Requires sampling or aggrega=ons for larger data • Distributed tools compromise in various ways

• Extrac=ng samples or aggrega=ons for larger data means: • “Scales” by losing more fidelity • Addi=onal ETL overhead to extract samples/aggrega=ons • Loss of produc=vity with mul=ple languages, tools, etc • Blocks certain analysis and use cases


Industry Analy=cs Scien=fic Compu=ng

Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines

Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats Fewer physical machines

Some simplis=c generaliza=ons


Industry Analy=cs Scien=fic Compu=ng

Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS-‐friendly / streaming data formats More physical machines

Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats (e.g. HDF5) Fewer physical machines

Some simplis=c generaliza=ons

Python: heavy investment, generally

Python: light investment, generally


pandas

• Hugely popular Python table / “data frame” library • Labeled table, array, and =me series data structures

• Popular for data prepara=on, ETL, and in-‐memory analy=cs • Built using Python’s scien=fic compu=ng stack • User API / domain specific language • Bespoke in-‐memory analy=cs / rela=onal algebra engine •  IO interfaces (CSV, SQL, etc.) • Expanded data type system (beyond NumPy)

•  Supports flat data only (or semi-‐structured data that can be flaqened)


Many SQL engines

… and more


The “Great Decoupling” for Big Data UI

Ibis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase


A sample big data architecture

Kafka

Kafka

Kafka

Kafka

Application dataHDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL


Nested / Complex types support

• Arrays, structs, maps, and unions as first-‐class value types • Analyze JSON-‐like data directly without flaqening or normaliza=on • Most new SQL engines have some level of support •  Impala • Presto • Drill • BigQuery • Spark SQL • Hive • …


Ibis in a nutshell

•  For Python programmers doing analy=cs in industry • Project Blog: hqp://blog.ibis-‐project.org •  Joint project with Impala team @ Cloudera • Apache-‐licensed, open source hqp://github.com/cloudera/ibis • Crabing a compelling Python-‐on-‐Hadoop user experience • Remove SQL coding from user workflows • Develop high performance Python extension APIs


Ibis in a nutshell, cont’d

• Composable Python DSL (“Ibis expressions”) makes hand-‐coding SQL SELECT statements unnecessary •  Ibis for SQL Programmers: hqp://docs.ibis-‐project.org/sql.html • Development roadmap targets Impala (C++ / LLVM) query engine • … but SQL compiler toolchain is general purpose

• Current supports Impala and SQLite, but soon other dialects • We welcome external contributors for other Analy=c SQL engines



Benefits of Ibis

• Maximize developer produc=vity • Mirrors single-‐node Python experience • Solve big data problems without leaving Python • Leverage Python skills, ecosystem, and tools

• Python as first-‐class language for Hadoop • Full-‐fidelity analysis without extrac=ons • Python analysis at any scale • Na=ve hardware speeds for a broad set of use cases


Brief interac=ve demo


Ibis/Impala Joint Roadmap

• More natural data modeling • Complex types support

•  Integra=on with full Python data ecosystem • Advanced analy=cs + machine learning • Enable use of performance compu=ng tools

• User extensibility with na=ve performance •  In-‐memory columnar format • Python-‐to-‐LLVM IR compila=on

• Workflow and usability tools


Execu=ng data science languages in the compute layer

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?


Enabling interoperability with big data systems

• Distributed / MPP query engines: implemented in a host language • Typically C/C++ or Java/Scala

• User-‐defined func=ons (UDFs) through various means •  Implement in host language •  Implement in user language through some external language protocol (oben RPC-‐based)

• External UDFs are usually very slow (cf: PL/Python, PySpark, etc.)


What are UDFs good for?

• Note: industry data scien=sts have libraries containing 100s of UDFs for Hive or other distributed query engines

• Custom data transforma=ons • Custom domain logic (date / =me / data types) • Custom data types • Custom aggrega=ons (incl. machine learning / sta=s=cs expressible as reduc=ons)


Why are external UDFs slow?

•  Serializa=on / deserializa=on overhead •  Scalar vs vectorized computa=ons • RPC overhead


Example: Vectoriza=on for interpreted languages

SUM(CASE WHEN x > y THEN x ELSE x + y END)


Vectorized vs Interpreted perf


How to make them fast?

• Common run=me memory representa=on for tabular data •  Share-‐memory (zero-‐copy or memcpy-‐only) external UDF protocol • Vectorized UDF interface (for interpreted languages) •  Impala is uniquely posi=oned to play well with Ibis • Best-‐in-‐class performance and scalability • C++ and LLVM-‐based (JIT compiler) run=me • Unified, efficient data interchange amongst Ibis, Impala, and Kudu will enable high performance real =me analy=cs from Python


Memory representa=on

• Many query engines are standardizing on in-‐memory columnar rep’n of materialized transient data •  Impala: hqp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/ • Apache Drill: hqps://drill.apache.org/faq/

•  Industry-‐standard serializa=on format: Apache Parquet • hqps://parquet.apache.org/


Serializa=on vs In-‐memory

•  Serializa=on formats (e.g. Parquet) • Op=mize for IO / DFS throughput at expense of CPU/memory bus throughput • Do not consider random access or in-‐memory analy=cs as a goal

• No standardized in-‐memory containers for materialized data from file / RPC protocols (Parquet, Thrib, protobuf, Avro, etc.)


Standardized in-‐memory columnar (IMC)

• Compact in-‐memory representa=on for semistructured data • Part of Impala’s upcoming dev roadmap •  Some prior IMC-‐for-‐SQL work: Apache Drill •  Standardized memory representa=on means data can be shared without serializa=on • Create a canonical C/C++ implementa=on for use in Python / R / Julia


Ibis’s Vision

• Uncompromised Python experience • 100% Python end-‐to-‐end user workflows • Enable integra=on with the exis=ng Python data ecosystem (pandas, scikit-‐learn, NumPy, etc)

•  Interac=ve at big data scale • Full-‐fidelity analysis without extrac=ons • Scalability for big data • Na=ve hardware speeds for a broad set of use cases


Thank you Wes McKinney @wesmckinn Views are my own