Python, Pandas, Spark 2.0
Sky
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa• Twitter : tamagawa_ryuji
2017
• Python Spark
•
•
• Python / Pandas
• Spark 2.0
Part 1 :
•
•
•
csv
Python
Pandas Python
Jupyter Notebook
Jenkins
Spark 2.0
• Spark API RDD ~1.3 DataFrame
/ DataSet 1.4~
• DataFrame API
RDD API Python Spark
DataFrame• RDB /
• R Pandas Spark
Spark
R / Pandas
Spark +
Part 2 :
CSVzip
RDB
Parquet
Excel
CSV
Feather
Spark
Pandas / Spark
•
• CPU
•
• Pandas read_csv zip CSV
Pandas
2
• CSV CPU
Pandas zip CSV
CPU …
• Parquet !
•
: Parquet
I/O
•
• Spark Parquet• Python Parquet
HDFS / S3
Parquet Parquet
SSD
Parquet Parquet
Parquet
No
No
Yes
HDD
•
• I/O Pandas
• Spark
• DataFrame Pandas → Spark
Spark → Pandas Pandas → Spark
• Apache Arrow
CPU
~2010
2010~SSD
CPU
Apache Spark 2.0• 1.x
• 2.0
1.x
• DataFrame API Python
• databricks
http://go.databricks.com/mastering-apache-spark-2.0
•
Spark 2.0
• CPU
• CPU
• SQL DataFrame
• + SSD
• CSV zip
Pandas read_csv
Python + Spark• Python serialize
• DataFrame API UDFUDF Scala/Java
• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-and-dataframe-api
Executor
JVM
DataFrame, Cached
Python
lambda items: items[0] == ‘abc’
transfer
DataFrame, result
transfer
Dri
ver