34
Best Practices for Using Alluxio with Spark Gene Pang, Alluxio, Inc. Cheng Chang, Alluxio, Inc. Spark Summit SF - June 2017

Best Practices for Using Alluxio with Spark

Embed Size (px)

Citation preview

Page 1: Best Practices for Using Alluxio with Spark

Best Practices for Using Alluxio with Spark

Gene Pang, Alluxio, Inc.Cheng Chang, Alluxio, Inc.Spark Summit SF - June 2017

Page 2: Best Practices for Using Alluxio with Spark

OutlineAlluxio Overview

Alluxio + Spark Use Cases

Using Spark with Alluxio

Performance Evaluation

Demo

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 2

Page 3: Best Practices for Using Alluxio with Spark

Data Ecosystem Yesterday

• One Compute Framework

• Single Storage System• Co-located

©2017 Alluxio, Inc. All Rights Reserved 3

Page 4: Best Practices for Using Alluxio with Spark

Data Ecosystem Today

• Many Compute Frameworks

• Multiple Storage Systems• Most not co-located

©2017 Alluxio, Inc. All Rights Reserved 4

Page 5: Best Practices for Using Alluxio with Spark

Data Ecosystem Issues

• Each application manage multiple data sources

• Add/Removing data sources require application changes

• Storage optimizations requires application change

• Lower performance due to lack of locality

©2017 Alluxio, Inc. All Rights Reserved 5

Page 6: Best Practices for Using Alluxio with Spark

Data Ecosystem with Alluxio

• Apps only talk to Alluxio

• Simple Add/Remove

• No App Changes

• Highest performance in Memory

• No Lock in

Native File System Hadoop Compatible File System

Native Key-Value Interface

Fuse Compatible File System

HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface

©2017 Alluxio, Inc. All Rights Reserved 6

Page 7: Best Practices for Using Alluxio with Spark

Next Gen Analytics with Alluxio

Native File System Hadoop Compatible File System

Native Key-Value Interface

Fuse Compatible File System

HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface

Apps, Data & Storageat Mem Speed

ü Big Data/IoTü AI/MLü Deep Learningü Cloud Migrationü Multi Platformü Autonomous

©2017 Alluxio, Inc. All Rights Reserved 7

Page 8: Best Practices for Using Alluxio with Spark

Fastest Growing Big Data Open Source Projects

Fastest Growing open-source project in the big data ecosystem

Running in large production clusters

500+ Contributors from 100+ organizations

0

100

200

300

400

5000 10 20 30 40 45

Num

ber

of C

ontr

ibut

ors

Github Open Source Contributors by Month

Alluxio

Spark

Kafka

Redis

HDFS

Cassandra

Hive

©2017 Alluxio, Inc. All Rights Reserved 8

Page 9: Best Practices for Using Alluxio with Spark

OutlineAlluxio Overview

Alluxio + Spark Use Cases

Using Spark with Alluxio

Performance Evaluation

Demo

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 9

Page 10: Best Practices for Using Alluxio with Spark

Big Data Case Study –

1 06/12/17   ©2017 Alluxio, Inc. All Rights Reserved

Challenge –Gain end to end view of business with large volume of data

Queries were slow / not interactive, resulting in operational inefficiency

SPARK

TERADATA

SPARK

TERADATA

Solution –ETL Data from Teradata to Alluxio

Impact –Faster Time to Market – “Now we don’t have to work Sundays”

http://bit.ly/2oMx95W

Page 11: Best Practices for Using Alluxio with Spark

Big Data Case Study –

1 16/12/17   ©2017 Alluxio, Inc. All Rights Reserved

Challenge –Gain end to end view of business with large volume of data

Queries were slow / not interactive, resulting in operational inefficiency

SPARK

Baidu File System

SPARK

Baidu File System

Solution –With Alluxio, data queries are 30X faster

Impact –Higher operational efficiency

http://bit.ly/2pDHS3O

Page 12: Best Practices for Using Alluxio with Spark

Big Data Case Study –

Challenge –Gain end to end view of business with large volume of data for $5B Travel Site

Queries were slow / not interactive, resulting in operational inefficiency

SPARK

HDFS

Solution –With Alluxio, 300x improvement in performance

Impact –Increased revenue from immediate response to user behaviorUse case: http://bit.ly/2pDJdrq

CEPH

HDFS CEPH

FLINK SPARK FLINK

©2017 Alluxio, Inc. All Rights Reserved 1 2

Page 13: Best Practices for Using Alluxio with Spark

Machine Learning Case Study –

1 36/12/17   ©2017 Alluxio, Inc. All Rights Reserved

Challenge –Disparate Data both on-premand Cloud. Heterogeneous types of data.

Scaling of Exabyte size data. Slow due to disk based approach.

SPARK

HDFS

SPARK

MINIO

Solution –Using Alluxio to prevent I/O bottlenecks

Impact –Orders of magnitude higher performance than before.http://bit.ly/2p18ds3

MES

OS

Page 14: Best Practices for Using Alluxio with Spark

OutlineAlluxio Overview

Alluxio + Spark Use Cases

Using Spark with Alluxio

Performance Evaluation

Demo

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 1 4

Page 15: Best Practices for Using Alluxio with Spark

Consolidating Memory

Storage Engine & Execution EngineSame Process

• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O

Spark Compute

Spark Storage

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Spark Compute

Spark Storage

block 1

block 3

©2017 Alluxio, Inc. All Rights Reserved 1 5

Page 16: Best Practices for Using Alluxio with Spark

Consolidating Memory

Storage Engine & Execution EngineDifferent process

• Half the memory used• Inter-process Sharing Happens at Memory Speed

Spark Compute

Spark Storage

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Spark Compute

Spark Storage

©2017 Alluxio, Inc. All Rights Reserved 1 6

Page 17: Best Practices for Using Alluxio with Spark

Data Resilience During Crash

Spark Compute

Spark Storageblock 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Storage Engine & Execution EngineSame Process

©2017 Alluxio, Inc. All Rights Reserved 1 7

Page 18: Best Practices for Using Alluxio with Spark

Data Resilience During Crash

CRASH

Spark Storageblock 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

• Process Crash Requires Network and/or Disk I/O to Re-read Data

Storage Engine & Execution EngineSame Process

©2017 Alluxio, Inc. All Rights Reserved 1 8

Page 19: Best Practices for Using Alluxio with Spark

Data Resilience During Crash

CRASH

HDFS / Amazon S3block 1

block 3

block 2

block 4

Storage Engine & Execution EngineSame Process

• Process Crash Requires Network and/or Disk I/O to Re-read Data

©2017 Alluxio, Inc. All Rights Reserved 1 9

Page 20: Best Practices for Using Alluxio with Spark

Data Resilience During Crash

Spark Compute

Spark Storage

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Storage Engine & Execution EngineDifferent process

©2017 Alluxio, Inc. All Rights Reserved 2 0

Page 21: Best Practices for Using Alluxio with Spark

Data Resilience During Crash

• Process Crash – Data is Re-read at Memory Speed

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

CRASH Storage Engine & Execution EngineDifferent process

©2017 Alluxio, Inc. All Rights Reserved 2 1

Page 22: Best Practices for Using Alluxio with Spark

Accessing Alluxio Data From Spark

Writing Data Write to an Alluxio file

Reading Data Read from an Alluxio file

©2017 Alluxio, Inc. All Rights Reserved 2 2

Page 23: Best Practices for Using Alluxio with Spark

Code Example for Spark RDDs

Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)

Reading RDD from Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)

©2017 Alluxio, Inc. All Rights Reserved 2 3

Page 24: Best Practices for Using Alluxio with Spark

Code Example for Spark DataFrames

Writing to Alluxio df.write.parquet(alluxioPath)

Reading from Alluxio df = sc.read.parquet(alluxioPath)

©2017 Alluxio, Inc. All Rights Reserved 2 4

Page 25: Best Practices for Using Alluxio with Spark

OutlineAlluxio Overview

Alluxio + Spark Use Cases

Using Spark with Alluxio

Performance Evaluation

Demo

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 2 5

Page 26: Best Practices for Using Alluxio with Spark

Experiments

Spark 2.0.0 + Alluxio 1.2.0

Single worker: Amazon r3.2xlarge

Comparisons:

Alluxio

Spark Storage Level: MEMORY_ONLY

Spark Storage Level: MEMORY_ONLY_SER

Spark Storage Level: DISK_ONLY

©2017 Alluxio, Inc. All Rights Reserved 2 6

Page 27: Best Practices for Using Alluxio with Spark

0

50

100

150

200

250

0 5 10 15 20 25 30 35 40 45 50

Tim

e [s

econ

ds]

RDD Size [GB]

Alluxio (textFile) Alluxio (objectFile) DISK_ONLY MEMORY_ONLY_SER MEMORY_ONLY

Reading Cached RDD

©2017 Alluxio, Inc. All Rights Reserved 2 7

Page 28: Best Practices for Using Alluxio with Spark

0 100 200 300 400 500 600 700 800

Alluxio(textFile)

Alluxio(objectFile)

No Alluxio

Time [seconds]

7x  speedup

16x  speedup

New Context: Read 50 GB RDD (S3)

©2017 Alluxio, Inc. All Rights Reserved 2 8

Page 29: Best Practices for Using Alluxio with Spark

Reading Cached DataFrame (parquet)

0

50

100

150

200

250

0 5 10 15 20 25 30 35 40 45 50

Tim

e [s

econ

ds]

DataFrame Size [GB]

Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY

©2017 Alluxio, Inc. All Rights Reserved 2 9

Page 30: Best Practices for Using Alluxio with Spark

New Context: Read 50 GB DataFrame(S3)

0 250 500 750 1000 1250 1500 1750

Alluxio

No Alluxio

Time [seconds]

10x average speedup, 17x peak speedup

©2017 Alluxio, Inc. All Rights Reserved 3 0

Page 31: Best Practices for Using Alluxio with Spark

OutlineAlluxio Overview

Alluxio + Spark Use Cases

Using Spark with Alluxio

Performance Evaluation

Demo

1

2

3

4

5

©2017 Alluxio, Inc. All Rights Reserved 3 1

Page 32: Best Practices for Using Alluxio with Spark

Demo Environment

Spark

Alluxio

©2017 Alluxio, Inc. All Rights Reserved 3 2

Page 33: Best Practices for Using Alluxio with Spark

Conclusion

Easy to use Alluxio with Spark

Predictable and improved performance

Easily connect to various storages

©2017 Alluxio, Inc. All Rights Reserved 3 3

Page 34: Best Practices for Using Alluxio with Spark

Thank you!Gene Pang Cheng [email protected] [email protected]: @unityxx Twitter: @uronce

Twitter.com/alluxio

Linkedin.com/alluxio

Websitewww.alluxio.com

[email protected]

@

Social Media

©2017 Alluxio, Inc. All Rights Reserved 3 4