Upload
alluxio-inc
View
996
Download
4
Embed Size (px)
Citation preview
Best Practices for Using Alluxio with Spark
Gene Pang, Alluxio, Inc.Cheng Chang, Alluxio, Inc.Spark Summit SF - June 2017
OutlineAlluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 2
Data Ecosystem Yesterday
• One Compute Framework
• Single Storage System• Co-located
©2017 Alluxio, Inc. All Rights Reserved 3
Data Ecosystem Today
• Many Compute Frameworks
• Multiple Storage Systems• Most not co-located
©2017 Alluxio, Inc. All Rights Reserved 4
Data Ecosystem Issues
• Each application manage multiple data sources
• Add/Removing data sources require application changes
• Storage optimizations requires application change
• Lower performance due to lack of locality
©2017 Alluxio, Inc. All Rights Reserved 5
Data Ecosystem with Alluxio
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in Memory
• No Lock in
Native File System Hadoop Compatible File System
Native Key-Value Interface
Fuse Compatible File System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc. All Rights Reserved 6
Next Gen Analytics with Alluxio
Native File System Hadoop Compatible File System
Native Key-Value Interface
Fuse Compatible File System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storageat Mem Speed
ü Big Data/IoTü AI/MLü Deep Learningü Cloud Migrationü Multi Platformü Autonomous
©2017 Alluxio, Inc. All Rights Reserved 7
Fastest Growing Big Data Open Source Projects
Fastest Growing open-source project in the big data ecosystem
Running in large production clusters
500+ Contributors from 100+ organizations
0
100
200
300
400
5000 10 20 30 40 45
Num
ber
of C
ontr
ibut
ors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc. All Rights Reserved 8
OutlineAlluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 9
Big Data Case Study –
1 06/12/17 ©2017 Alluxio, Inc. All Rights Reserved
Challenge –Gain end to end view of business with large volume of data
Queries were slow / not interactive, resulting in operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –ETL Data from Teradata to Alluxio
Impact –Faster Time to Market – “Now we don’t have to work Sundays”
http://bit.ly/2oMx95W
Big Data Case Study –
1 16/12/17 ©2017 Alluxio, Inc. All Rights Reserved
Challenge –Gain end to end view of business with large volume of data
Queries were slow / not interactive, resulting in operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution –With Alluxio, data queries are 30X faster
Impact –Higher operational efficiency
http://bit.ly/2pDHS3O
Big Data Case Study –
Challenge –Gain end to end view of business with large volume of data for $5B Travel Site
Queries were slow / not interactive, resulting in operational inefficiency
SPARK
HDFS
Solution –With Alluxio, 300x improvement in performance
Impact –Increased revenue from immediate response to user behaviorUse case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
©2017 Alluxio, Inc. All Rights Reserved 1 2
Machine Learning Case Study –
1 36/12/17 ©2017 Alluxio, Inc. All Rights Reserved
Challenge –Disparate Data both on-premand Cloud. Heterogeneous types of data.
Scaling of Exabyte size data. Slow due to disk based approach.
SPARK
HDFS
SPARK
MINIO
Solution –Using Alluxio to prevent I/O bottlenecks
Impact –Orders of magnitude higher performance than before.http://bit.ly/2p18ds3
MES
OS
OutlineAlluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 1 4
Consolidating Memory
Storage Engine & Execution EngineSame Process
• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
Spark Compute
Spark Storage
block 1
block 3
©2017 Alluxio, Inc. All Rights Reserved 1 5
Consolidating Memory
Storage Engine & Execution EngineDifferent process
• Half the memory used• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
©2017 Alluxio, Inc. All Rights Reserved 1 6
Data Resilience During Crash
Spark Compute
Spark Storageblock 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
Storage Engine & Execution EngineSame Process
©2017 Alluxio, Inc. All Rights Reserved 1 7
Data Resilience During Crash
CRASH
Spark Storageblock 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine & Execution EngineSame Process
©2017 Alluxio, Inc. All Rights Reserved 1 8
Data Resilience During Crash
CRASH
HDFS / Amazon S3block 1
block 3
block 2
block 4
Storage Engine & Execution EngineSame Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
©2017 Alluxio, Inc. All Rights Reserved 1 9
Data Resilience During Crash
Spark Compute
Spark Storage
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
Storage Engine & Execution EngineDifferent process
©2017 Alluxio, Inc. All Rights Reserved 2 0
Data Resilience During Crash
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
CRASH Storage Engine & Execution EngineDifferent process
©2017 Alluxio, Inc. All Rights Reserved 2 1
Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
©2017 Alluxio, Inc. All Rights Reserved 2 2
Code Example for Spark RDDs
Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)
Reading RDD from Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)
©2017 Alluxio, Inc. All Rights Reserved 2 3
Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
©2017 Alluxio, Inc. All Rights Reserved 2 4
OutlineAlluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 2 5
Experiments
Spark 2.0.0 + Alluxio 1.2.0
Single worker: Amazon r3.2xlarge
Comparisons:
Alluxio
Spark Storage Level: MEMORY_ONLY
Spark Storage Level: MEMORY_ONLY_SER
Spark Storage Level: DISK_ONLY
©2017 Alluxio, Inc. All Rights Reserved 2 6
0
50
100
150
200
250
0 5 10 15 20 25 30 35 40 45 50
Tim
e [s
econ
ds]
RDD Size [GB]
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY
Reading Cached RDD
©2017 Alluxio, Inc. All Rights Reserved 2 7
0 100 200 300 400 500 600 700 800
Alluxio(textFile)
Alluxio(objectFile)
No Alluxio
Time [seconds]
7x speedup
16x speedup
New Context: Read 50 GB RDD (S3)
©2017 Alluxio, Inc. All Rights Reserved 2 8
Reading Cached DataFrame (parquet)
0
50
100
150
200
250
0 5 10 15 20 25 30 35 40 45 50
Tim
e [s
econ
ds]
DataFrame Size [GB]
Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
©2017 Alluxio, Inc. All Rights Reserved 2 9
New Context: Read 50 GB DataFrame(S3)
0 250 500 750 1000 1250 1500 1750
Alluxio
No Alluxio
Time [seconds]
10x average speedup, 17x peak speedup
©2017 Alluxio, Inc. All Rights Reserved 3 0
OutlineAlluxio Overview
Alluxio + Spark Use Cases
Using Spark with Alluxio
Performance Evaluation
Demo
1
2
3
4
5
©2017 Alluxio, Inc. All Rights Reserved 3 1
Demo Environment
Spark
Alluxio
©2017 Alluxio, Inc. All Rights Reserved 3 2
Conclusion
Easy to use Alluxio with Spark
Predictable and improved performance
Easily connect to various storages
©2017 Alluxio, Inc. All Rights Reserved 3 3
Thank you!Gene Pang Cheng [email protected] [email protected]: @unityxx Twitter: @uronce
Twitter.com/alluxio
Linkedin.com/alluxio
Websitewww.alluxio.com
@
Social Media
©2017 Alluxio, Inc. All Rights Reserved 3 4