20
Alluxio (formerly Tachyon): Getting Started with Alluxio + Spark + S3 Calvin Jia June 15, 2016 @ Alluxio Meetup (hosted by Intel) Related Blog Post: http://goo.gl/MUpL0O

Getting Started with Alluxio + Spark + S3

Embed Size (px)

Citation preview

Alluxio (formerly Tachyon):Getting Started with Alluxio + Spark + S3

Calvin Jia

June 15, 2016 @ Alluxio Meetup (hosted by Intel)

Related Blog Post: http://goo.gl/MUpL0O

Who Am I?

• Calvin Jia

• SWE @ Alluxio, Inc.

• Alluxio PMC Member

• Twitter: @JiaCalvin

2

Outline

• Technology Overview

• Alluxio + Spark + S3

• Demo

3

Alluxio Ecosystem

4

Why Alluxio?

• Data sharing between jobs

• Data resilience during application crashes

• Consolidate memory usage and alleviate GC issues

5

In-Memory Storage

block 1

block 3

In-Memory Storage

block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Data Sharing Between Jobs

Inter-process sharing slowed down by network I/O6

Data Sharing Between Jobs

block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

storage & execution engineseparated

Inter-process sharing can happen at memory speed7

Data Resilience during Crashes

In-Memory Storageblock 1

block 3

block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Process crash requires network I/O to re-read the data

8

Data Resilience during Crashes

Crash

In-Memory Storageblock 1

block 3

block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Process crash requires network I/O to re-read the data

9

Data Resilience during Crashes

block 1

block 3

block 2

block 4

Crash

storage engine & execution enginesame process

Process crash requires network I/O to re-read the data

10

Data Resilience during Crashes

storage & execution engineseparated

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

Process crash only needs memory I/O to re-read the data

11

Data Resilience during Crashes

Crashstorage & execution engineseparated

Process crash only needs memory I/O to re-read the data

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

12

Consolidating Memory

In-MemoryStorage

block 1

block 3

In-MemoryStorage

block 3

block 1

block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Data duplicated at memory-level

13

Consolidating Memory

block 1

block 3

block 2

block 4

storage & execution engineseparated

HDFSdisk

block 1

block 3

block 2

block 4 In-Memory

block 1

block 3 block 4

Data not duplicated at memory-level

14

Outline

• Technology Overview

• Alluxio + Spark + S3

• Demo

15

Visualizing the Stack

16

FAST 104 - 105 MB/s

MODERATE 103 - 104 MB/s

SLOW 102 - 103 MB/s

Only when necessaryLimited

Often

SSDHDD

Mem

When to use Alluxio

•Two or more jobs access the same dataset•Job(s) may not always succeed•Dataset larger than Spark JVM•Jobs are pipelined•Resulting data does not need to be immediately persisted

17

Version Selection

• Alluxio 1.1.0–Latest released version–Many improvements, upgrade recommended

• Spark 1.6.1–Latest released version–Remember to use Spark Alluxio client, ie. -Pspark

–Spark 2.0 is coming out soon, will recommend the best way to integrate with Alluxio

18

API Selection• Access data directly through the FileSystem API, but

change scheme to alluxio://–Minimal code change–Do not need to reason about logic

•Example:–val file = sc.textFile(“s3n://my-bucket/myFile”)–val file = sc.textFile(“alluxio://master:19998/myFile”)

19

Outline

• Technology Overview

• Alluxio + Spark + S3

• Demo

20