19
1 © Cloudera, Inc. All rights reserved. Enabling Python to be a Be=er Big Data Ci?zen Wes McKinney @wesmckinn NYC Python Meetup 20160217

Enabling Python to be a Better Big Data Citizen

Embed Size (px)

Citation preview

Page 1: Enabling Python to be a Better Big Data Citizen

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Enabling  Python  to  be  a  Be=er  Big  Data  Ci?zen  Wes  McKinney  @wesmckinn  NYC  Python  Meetup  2016-­‐02-­‐17  

Page 2: Enabling Python to be a Better Big Data Citizen

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Me  

• R&D  at  Cloudera,  formerly  DataPad  CEO/founder  •  Serial  creator  of  structured  data  tools  /  user  interfaces  • Wrote  bestseller  Python  for  Data  Analysis  2012  • Open  source  projects  • Python  {pandas,  Ibis,  statsmodels}  • Apache  {Arrow,  Parquet,  Kudu  (incuba?ng)}  

• Mostly  work  in  Python  and  Cython/C/C++    

Page 3: Enabling Python to be a Better Big Data Citizen

3  ©  Cloudera,  Inc.  All  rights  reserved.  

Industry  Analy?cs   Scien?fic  Compu?ng  

Heterogeneous  data          Flat  tables  and  JSON  Spark  /  MapReduce  SQL  DFS-­‐friendly  /  streaming  data  formats  More  physical  machines  

Homogeneous  data          Mul?dimensional  arrays  HPC  tools  Linear  algebra  Scien?fic  data  formats  (e.g.  HDF5)  Fewer  physical  machines  

Some  simplis?c  generaliza?ons  

Python:  heavy  investment,    generally  

Python:  light  investment,  generally  

Page 4: Enabling Python to be a Better Big Data Citizen

4  ©  Cloudera,  Inc.  All  rights  reserved.  

A  sample  big  data  architecture  

Kafka

Kafka

Kafka

Kafka

Application dataHDFS

JSON Spark/MapReduce

Columnar storage

Analytic SQL Engine

User

SQL

Page 5: Enabling Python to be a Better Big Data Citizen

5  ©  Cloudera,  Inc.  All  rights  reserved.  

pandas  

• Hugely  popular  Python  table  /  “data  frame”  library  • Labeled  table,  array,  and  ?me  series  data  structures  

• Popular  for  data  prepara?on,  ETL,  and  in-­‐memory  analy?cs  • Built  using  Python’s  scien?fic  compu?ng  stack  • User  API  /  domain  specific  language  • Bespoke  in-­‐memory  analy?cs  /  rela?onal  algebra  engine  •  IO  interfaces  (CSV,  SQL,  etc.)  • Expanded  data  type  system  (beyond  NumPy)  

•  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  fla=ened)  

Page 6: Enabling Python to be a Better Big Data Citizen

6  ©  Cloudera,  Inc.  All  rights  reserved.  

2016  Python  Data  Trends  

•  Improved  Python  interoperability  with  the  Apache  Hadoop  ecosystem  •  I’m  working  with  {Arrow,  Kudu,  Impala,  Parquet,  Spark}  

•  Support  for  big  data  file  formats  like  Apache  Parquet  • Na?ve  in-­‐memory  Python  support  for  nested  /  JSON-­‐like  data  

Page 7: Enabling Python to be a Better Big Data Citizen

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Ibis  in  a  nutshell  

•  For  Python  programmers  doing  analy?cs  in  industry  • Project  Blog:  h=p://blog.ibis-­‐project.org  • Cross-­‐team  project  @  Cloudera  • Apache-­‐licensed,  open  source  h=p://github.com/cloudera/ibis    • Craoing  a  compelling  Python-­‐on-­‐Hadoop  user  experience  • Remove  SQL  coding  from  user  workflows  • Develop  high  performance  extensions  in  Python  

Page 8: Enabling Python to be a Better Big Data Citizen

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Page 9: Enabling Python to be a Better Big Data Citizen

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Enabling  interoperability  with  big  data  systems  

• Distributed  /  MPP  query  engines:  implemented  in  a  host  language  • Typically  C/C++  or  Java/Scala  

• User-­‐defined  func?ons  (UDFs)  through  various  means  •  Implement  in  host  language  •  Implement  in  user  language  through  some  external  language  protocol  (ooen  RPC-­‐based)  

• External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  

Page 10: Enabling Python to be a Better Big Data Citizen

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Execu?ng  data  science  languages  in  the  compute  layer  

UIIbis, SQL, Spark API, …

ComputeAnalytic SQL, Spark, MapReduce

StorageHDFS, Kudu, HBase

Python, R, Julia, …?

Page 11: Enabling Python to be a Better Big Data Citizen

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  interoperability  challenges  

• Problem  1:  Serializa?on  /  deserializa?on  overhead  

in partition 0

in partition n - 1

Big data system

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

out partition n - 1

Big data system

Page 12: Enabling Python to be a Better Big Data Citizen

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  movement  can  be  extremely  costly  

in partition 0 Python function

input

Ques:ons  •  How  to  represent  “data  in-­‐flight”  (RPC)?  •  Cost  of  conversion  between  in-­‐memory  data  structures  and  RPC  representa?on  •  How  to  communicate  schemas  /  metadata?  

Page 13: Enabling Python to be a Better Big Data Citizen

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  movement  can  be  extremely  costly  

in partition 0 Python function

input

Slow  data  movement  /  conversion  can  largely  undermine  the  performance  benefits  of  Python’s  

high  performance  in-­‐memory  data  tools  

Page 14: Enabling Python to be a Better Big Data Citizen

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  interoperability  challenges  

• Problem  2:  Scalar  vs  vectorized  computa?ons  

result = np.empty(n)for i in range(n): result[i] = f(a[i], b[i])

result = f(a, b)

SCALAR

VECTORIZEDoften100-1000x faster

Page 15: Enabling Python to be a Better Big Data Citizen

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Arrow:  What  is  it?    

• h=p://arrow.apache.org  • Not  a  piece  of  sooware,  exactly!  • A  standardized  in-­‐memory  representa?on  for  columnar  data  • Enables  • Suitable  for  implemen?ng  high-­‐performance  analy?cs  in-­‐memory  (think  like  “pandas  internals”)  • Cheap  data  interchange  amongst  systems,  li=le  or  no  serializa?on  • Flexible  support  for  complex  JSON-­‐like  data  

• Targets:  Impala,  Kudu,  Parquet,  Spark  

Page 16: Enabling Python to be a Better Big Data Citizen

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Columnar  data  persons'='[''{''''name:'‘wes’,''''addresses:'['''''''{number:'2,'street:'‘a’},'''''''{number:'3,'street:'‘bb’},'''']''},''{''''name:'‘mark’,''''addresses:'['''''''{number:'4,'street:'‘ccc’},'''''''{number:'5,'street:'‘dddd’},'''''''{number:'6,'street:'‘f’},'''']''},

Page 17: Enabling Python to be a Better Big Data Citizen

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Columnar  data  person.addresses.street

person.addresses

025

offset013610

abbcccddddf

person.addresses.number

23456

offset

Page 18: Enabling Python to be a Better Big Data Citizen

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Arrow  in  prac?ce  

Page 19: Enabling Python to be a Better Big Data Citizen

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  Wes  McKinney  @wesmckinn  Views  are  my  own