Alluxio Presentation at Strata San Jose 2016

Alluxio (formerly Tachyon):Unified Namespace and Tiered Storage

Calvin Jia, Jiri Simsa

One of the Things to Watch at Strata

TechCrunch article:

“… An interesting item that made the top terms list is “alluxio,” which is the recently renamed Tachyon project. Alluxio is a virtual distributed storage system, and it has a memory-centric architecture that enables data sharing across clusters at memory speed. … “

Who Are We?

• Calvin Jia

• SWE @ Alluxio, Inc.

• #1 Alluxio contributor

• Twitter: @JiaCalvin

• Jiri Simsa

• SWE @ Alluxio, Inc

• CMU Ph.D. & Google

• Twitter: @jsimsa

Alluxio Inc.

• Founded by Alluxio creators and top committers

• Formerly Tachyon Nexus, Inc.

• $7.5 million Series A by Andreessen Horowitz

• Committed to the Alluxio Open Source Project

• Company Website: http://www.alluxio.com

• We are hiring!

Outline

• Alluxio Introduction

• Tiered Storage

• Unified Namespace

ALLUXIO:

Open Source Memory Speed

Virtual Distributed Storage

Memory Speed• Memory-centric architecture designed for memory I/O

Virtual• Abstracts persistent storage from applications

Distributed• Designed to scale with nothing but commodity hardware

Open Source• One of the fastest growing project communities

Contributor Growth

• Over 200 Contributors– 3x growth over the last year

Organizations

• Over 50 Organizations

Alluxio Ecosystem

Memory is Getting Faster

Memory is Getting Cheaper

Simple Examples

• Data sharing between frameworks

• Data resilience during application crashes

• Consolidate memory usage and alleviate GC issues

Spark Job

Spark Memory

block 1

block 3

Hadoop MR Job

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Data Sharing Between Frameworks

Inter-process sharing slowed down by network and/or disk I/O

Data Sharing Between Frameworks

Spark Job

Spark Memory

Hadoop MR Job

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

Inter-process sharing can happen at memory speed

Data Resilience during Crashes

Spark Task

Spark Memoryblock manager

block 1

block 3

block 2

block 4

Process crash requires network and/or disk I/O to re-read the data

block 1

block 3

block 2

block 4

HDFS / Amazon S3

block 1

block 3

block 2

block 4

Spark Task

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

Process crash only needs memory I/O to re-read the data

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

HDFS / Amazon S3

Consolidating Memory

Spark Job1

SparkMemory

block 1

block 3

Spark Job2

SparkMemory

block 3

block 1

block 3

block 2

block 4

Data duplicated at memory-level

Consolidating Memory

Spark Job1

Spark mem

Spark Job2

Spark mem

block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Alluxio

In-Memory

block 1

block 3 block 4

Data not duplicated at memory-level

Case Study: Barclays

Making the Impossible Possible with Tachyon: Accelerate Spark J

obs from Hours to

Seconds

• Application: SparkSQL + Spark RDDs

• Alluxio Storage Layer: MEM

• Backend Storage: None

• Result: Speeding up Spark jobs from hours to seconds

Common Questions–Memory speed sharing among distributed applications

HDFS interface compatible

–GC overhead introduced by in-memory caching

Off-Heap Memory Management

–Data set could be larger than available memory

Tiered storage

Outline

• Alluxio Introduction

• Tiered Storage

Motivation

• Memory resources are still constrained

• Alluxio data management logic is not limited to memory

• Storage resources available on compute clusters

Tiered Storage

Tiered Storage• Extends Alluxio with support for SSDs and/or HDDs storage

• Different tiers have different characteristics– Keep hot data in fast but limited storage– Keep warm data in slower but abundant storage

• Workers manage their own storage

• Data allocation and eviction is driven by application access

Tiered Storage Architecture

Machine Type 1

Compute ClientAlluxio Master

Memory, SSD, HDD

Machine Type 2

Compute ClientAlluxio Worker

Memory, SSD, HDD

Tiered Storage Architecture

Machine Type 2

Compute Client• Alluxio Client

Alluxio Worker• Tiered Block Store

• Evictor• Allocator

Memory, SSD, HDD

Automatic Data Migration• Data can be evicted to lower layers if it is “cooling down”

• Data can be promoted to upper layers if it is “warming up”

Evict stale data to lower tier

Promote hot data to upper tier

Pluggable Policies

• Policies can be customized to suit workloads

• Defaults provided for general scenarios

• Advanced users can optimize with additional knowledge– For example: Optimize for iterations

Case Study: Baidu

Baidu Queries Data 30 Times Faster with Alluxio

• Application: Spark

• Alluxio Storage: MEM + HDD

• Backend Storage: Baidu’s File System

• 200+ nodes deployment, 2PB+ managed space

• Result: Speeding up data querying by 30x

Outline

• About Alluxio

• Tiered Storage

Big Data Ecosystem

Motivation• At large organizations, data spans many storage systems

(object storage, network / distributed file systems, DBs)

• Application logic needs to integrate with different types of storage systems

• Data needs to be moved around to work around application limitations

• In-house storage layers are built to address limitations of legacy storage systems

Transparent Naming• Applications can transparently and efficiently interact with

remote storage through Alluxio.• Applications do not need to use different APIs for interacting

with different storage systems.

alluxio://host:port/

data users

reports sales alice bob

s3n://bucket/directory

data users

Alluxio Storage System

Single Namespace• Applications can read and write different storage systems.

• Decouples data location from application

alluxio://host:port/

data users

hdfs://host:port/

alice bob

s3n://bucket/directory

reports sales

Alluxio Storage System A

Storage System B

Architecture

Alluxio Interface

UFS Interface

HDFSS3 Swift …

S3adapter

Swiftadapter

HDFSadapter ALLUXIO

Alluxio Benefits• Enable new workloads across storage systems

• Work with the framework of your choice

• Scale storage and compute independently

Resources

• Alluxio Project: http://www.alluxio.org

• Development: https://github.com/Alluxio/alluxio

• Meet Friends: http://www.meetup.com/Alluxio

• Alluxio Inc: http://www.alluxio.com

• Contact us: info@alluxio.com

Alluxio Presentation at Strata San Jose 2016

Internet

Strata San Jose 2016: Deep Learning is eating your lunch -- and mine

Multi-Datacenter Kafka - Strata San Jose 2017

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Alluxio Keynote at Strata+Hadoop World Beijing 2016

Alluxio: Unify Data at Memory Speed; 2016-11-18

Bokeh Tutorial - PyData @ Strata San Jose 2015

Evaluation of the Suitability of Alluxio for Hadoop Processing … · 2016-09-16 · Software: Cloudera Manager: 5.8.1, Spark: 1.6.0, Hadoop: 2.6.0-cdh5.7.1, Alluxio: 1.2.0. 2 Installation

Accelerating Spark Workloads in an Apache Mesos Environment with Alluxio

Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Conference San Jose 2017

Apache Flink at Strata San Jose 2016

Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World San Jose 2016

Strata + Hadoop World San Jose Presentation: Overcoming the Top Five Hurdles to Real-time Analytics

Strata San Jose 2017 - Ben Sharma Presentation

Spark Pipelines in the Cloud with Alluxio

Strata San Jose 2016 - Reduce False Positives in Security

Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016

Accelerating Spark Workloads in a Mesos Environment with Alluxio

Whitepaper Alluxio Overview v1.0 3.25 · Alluxio Overview WHITEPAPER What’s Inside 1 / Introduction 2 / Data Access Challenges ... deployment options include bare metal, in Docker

Owning time series with team apache Strata San Jose 2015

Big Data at Oracle - Strata 2015 San Jose