Alluxio Presentation at Strata San Jose 2016

Preview:

Citation preview

Alluxio (formerly Tachyon):Unified Namespace and Tiered Storage

Calvin Jia, Jiri Simsa

2

One of the Things to Watch at Strata

TechCrunch article:

“… An interesting item that made the top terms list is “alluxio,” which is the recently renamed Tachyon project. Alluxio is a virtual distributed storage system, and it has a memory-centric architecture that enables data sharing across clusters at memory speed. … “

3

Who Are We?

• Calvin Jia

• SWE @ Alluxio, Inc.

• #1 Alluxio contributor

• Twitter: @JiaCalvin

• Jiri Simsa

• SWE @ Alluxio, Inc

• CMU Ph.D. & Google

• Twitter: @jsimsa

4

Alluxio Inc.

• Founded by Alluxio creators and top committers

• Formerly Tachyon Nexus, Inc.

• $7.5 million Series A by Andreessen Horowitz

• Committed to the Alluxio Open Source Project

• Company Website: http://www.alluxio.com

• We are hiring!

5

Outline

• Alluxio Introduction

• Tiered Storage

• Unified Namespace

6

ALLUXIO:

Open Source Memory Speed

Virtual Distributed Storage

7

Memory Speed• Memory-centric architecture designed for memory I/O

Virtual• Abstracts persistent storage from applications

Distributed• Designed to scale with nothing but commodity hardware

Open Source• One of the fastest growing project communities

8

Contributor Growth

• Over 200 Contributors– 3x growth over the last year

9

Organizations

• Over 50 Organizations

10

Alluxio Ecosystem

11

Memory is Getting Faster

12

Memory is Getting Cheaper

13

Simple Examples

• Data sharing between frameworks

• Data resilience during application crashes

• Consolidate memory usage and alleviate GC issues

14

Spark Job

Spark Memory

block 1

block 3

Hadoop MR Job

YARN

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Data Sharing Between Frameworks

Inter-process sharing slowed down by network and/or disk I/O

15

Data Sharing Between Frameworks

Spark Job

Spark Memory

Hadoop MR Job

YARN

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

storage engine & execution enginesame process

Inter-process sharing can happen at memory speed

16

Data Resilience during Crashes

Spark Task

Spark Memoryblock manager

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Process crash requires network and/or disk I/O to re-read the data

17

Data Resilience during Crashes

Crash

Spark Memoryblock manager

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Process crash requires network and/or disk I/O to re-read the data

18

HDFS / Amazon S3

Data Resilience during Crashes

block 1

block 3

block 2

block 4

Crash

storage engine & execution enginesame process

Process crash requires network and/or disk I/O to re-read the data

19

Data Resilience during Crashes

Spark Task

Spark Memoryblock manager

storage engine & execution enginesame process

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

Process crash only needs memory I/O to re-read the data

20

Data Resilience during Crashes

Crash

storage engine & execution enginesame process

Process crash only needs memory I/O to re-read the data

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

21

HDFS / Amazon S3

Consolidating Memory

Spark Job1

SparkMemory

block 1

block 3

Spark Job2

SparkMemory

block 3

block 1

block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Data duplicated at memory-level

22

Consolidating Memory

Spark Job1

Spark mem

Spark Job2

Spark mem

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

Data not duplicated at memory-level

23

Case Study: Barclays

Making the Impossible Possible with Tachyon: Accelerate Spark J

obs from Hours to

Seconds

• Application: SparkSQL + Spark RDDs

• Alluxio Storage Layer: MEM

• Backend Storage: None

• Result: Speeding up Spark jobs from hours to seconds

24

Common Questions–Memory speed sharing among distributed applications

HDFS interface compatible

–GC overhead introduced by in-memory caching

Off-Heap Memory Management

–Data set could be larger than available memory

Tiered storage

25

Outline

• Alluxio Introduction

• Tiered Storage

• Unified Namespace

26

Motivation

• Memory resources are still constrained

• Alluxio data management logic is not limited to memory

• Storage resources available on compute clusters

27

Tiered Storage

MEM

SSD

HDD

28

Tiered Storage• Extends Alluxio with support for SSDs and/or HDDs storage

• Different tiers have different characteristics– Keep hot data in fast but limited storage– Keep warm data in slower but abundant storage

• Workers manage their own storage

• Data allocation and eviction is driven by application access

29

Tiered Storage Architecture

Machine Type 1

Compute ClientAlluxio Master

Memory, SSD, HDD

Machine Type 2

Compute ClientAlluxio Worker

Memory, SSD, HDD

30

Tiered Storage Architecture

Machine Type 2

Compute Client• Alluxio Client

Alluxio Worker• Tiered Block Store

• Evictor• Allocator

Memory, SSD, HDD

31

Automatic Data Migration• Data can be evicted to lower layers if it is “cooling down”

• Data can be promoted to upper layers if it is “warming up”

Evict stale data to lower tier

Promote hot data to upper tier

32

Pluggable Policies

• Policies can be customized to suit workloads

• Defaults provided for general scenarios

• Advanced users can optimize with additional knowledge– For example: Optimize for iterations

33

Case Study: Baidu

Baidu Queries Data 30 Times Faster with Alluxio

• Application: Spark

• Alluxio Storage: MEM + HDD

• Backend Storage: Baidu’s File System

• 200+ nodes deployment, 2PB+ managed space

• Result: Speeding up data querying by 30x

34

Outline

• About Alluxio

• Tiered Storage

• Unified Namespace

35

Big Data Ecosystem

36

Big Data Ecosystem

37

Big Data Ecosystem

38

Motivation• At large organizations, data spans many storage systems

(object storage, network / distributed file systems, DBs)

• Application logic needs to integrate with different types of storage systems

• Data needs to be moved around to work around application limitations

• In-house storage layers are built to address limitations of legacy storage systems

39

Transparent Naming• Applications can transparently and efficiently interact with

remote storage through Alluxio.• Applications do not need to use different APIs for interacting

with different storage systems.

alluxio://host:port/

data users

reports sales alice bob

s3n://bucket/directory

data users

reports sales alice bob

Alluxio Storage System

40

Single Namespace• Applications can read and write different storage systems.

• Decouples data location from application

alluxio://host:port/

data users

reports sales alice bob

hdfs://host:port/

users

alice bob

s3n://bucket/directory

reports sales

Alluxio Storage System A

Storage System B

41

Architecture

Alluxio Interface

UFS Interface

HDFSS3 Swift …

S3adapter

Swiftadapter

HDFSadapter ALLUXIO

42

Alluxio Benefits• Enable new workloads across storage systems

• Work with the framework of your choice

• Scale storage and compute independently

43

Resources

• Alluxio Project: http://www.alluxio.org

• Development: https://github.com/Alluxio/alluxio

• Meet Friends: http://www.meetup.com/Alluxio

• Alluxio Inc: http://www.alluxio.com

• Contact us: info@alluxio.com

Recommended