30
The Architecture of Decoupling Compute and Storage with Alluxio Haoyuan Li, Alluxio Inc. December 2017 @ Strata Singapore

The Architecture of Decoupling Compute and Storage with Alluxio

Embed Size (px)

Citation preview

The Architecture of Decoupling Compute and Storage with Alluxio Haoyuan Li, Alluxio Inc.

December 2017 @ Strata Singapore

Confidential © Alluxio, Inc. All Rights Reserved. 2

About Me

•  Haoyuan (H.Y.) Li

•  Founder and CEO at Alluxio, Inc.

•  Created Alluxio (as name Tachyon) at UC Berkeley AMPLab, as Ph.D.

candidate.

•  Google, Cornell University, Peking University

Confidential © Alluxio, Inc. All Rights Reserved. 3

Decoupling Compute From Storage

Benefits – Different compute and storage hardware requirements Scale compute and storage resources independently Traditional filers/SANs and cost effective object stores (Amazon S3, Google GCS, Microsoft Azure Blob Store) are inherently decoupled Fast-evolving big data eco-system

Challenges – Accessing data requires remote I/O

Confidential © Alluxio, Inc. All Rights Reserved. 4

Remote I/O Spark

Amazon S3

Every data operation requires data transfer,

sometimes over the WAN

High latency, network throughput

Confidential © Alluxio, Inc. All Rights Reserved. 5

Data Operations with Alluxio Spark

Amazon S3

Alluxio

Low latency, memory throughput

High latency, network throughput

Keeping data in Alluxio accelerates data

access

Confidential © Alluxio, Inc. All Rights Reserved. 6

Data Ecosystem Develops

• One Compute Framework

• Single Storage System

• Co-located ETL

ETL

ETL

Confidential © Alluxio, Inc. All Rights Reserved. 7

Data Ecosystem Explodes

•  Many Compute Frameworks

•  Many Storage Systems

•  Most not co-located

Confidential © Alluxio, Inc. All Rights Reserved. 8

Data Ecosystem Issues

•  Each app manages multiple data sources

•  Data source changes require global updates

•  Storage optimizations requires app change

•  Poor performance due to lack of locality

Confidential © Alluxio, Inc. All Rights Reserved. 9

Data Ecosystem with Alluxio

•  Apps only talk to Alluxio

•  Simple Add/Remove

•  No App Changes

•  Highest performance in Memory

•  No Lock in

Native File System Hadoop Compatible File System Amazon S3 Interface REST Web Service

HDFS Interface Amazon S3 Interface Swift Interface NFS Interface

Confidential © Alluxio, Inc. All Rights Reserved. 10

Use Case – Multi -Cluster Federated Query

Confidential © Alluxio, Inc. All Rights Reserved. 11 11

Confidential © Alluxio, Inc. All Rights Reserved. 12

History

Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in early 2016

Open Sourced in 2013 Apache License 2.0 Latest Stable Release: Alluxio 1.6.1 (Nov 2017)

Confidential © Alluxio, Inc. All Rights Reserved. 13

Fastest Growing Big Data Open Source Project

Fastest Growing open-source project in the big data ecosystem

Running world’s largest production clusters

600+ Contributors from 100+ organizations

0

100

200

300

400

500

600

700

800 0 10 20 30 40 45 50 55

Num

ber

of C

ontr

ibut

ors

Open Source Contributors by Month (Github)

Alluxio

Spark

Kafka

Redis

HDFS

Cassandra

Hive

Confidential © Alluxio, Inc. All Rights Reserved. 14

Non-persistent data-storage

software.

What’s Alluxio

M e m o r y - S p e e d V i r t u a l D i s t r i b u t e d S t o r a g e

Scale out architecture.

Virtualized across different storage types under a unified namespace.

Memory-speed access to data.

Confidential © Alluxio, Inc. All Rights Reserved. 15

Alluxio Innovation: �

Unified NamespaceEnables effective data management across different Under Stores

Uses Mounting with Transparent Naming

Confidential © Alluxio, Inc. All Rights Reserved. 16

Alluxio Innovation: �

Unified NamespaceCreate a catalog of available data sources for Data Scientists

/finance/customer-transactions//finance/vendor-transactions//operations/device-logs//operations/phone-call-recordings//operations/check-images//research/us-economic-data//research/intl-economic-data//marketing/advertising-dataset//marketing/marketing-funnel-dataset/

alluxio://

Confidential © Alluxio, Inc. All Rights Reserved. 17

Alluxio Innovation: �

Server-side API TranslationConvert from Client-side Interface to native Storage Interface

HDFS Interface

HDFS Interface S3A Interface Swift Interface Google Cloud Interface

Confidential © Alluxio, Inc. All Rights Reserved. 18

Alluxio Innovation: �

Intelligent CacheLocal performance from remote data using multi-tier storage

RAM SSD HDD

Hot Warm Cold

Read & Write Buffering Transparent to App

Policies for pinning, promotion/demotion, TTL

Confidential © Alluxio, Inc. All Rights Reserved. 19

Alluxio Benefits

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

Confidential © Alluxio, Inc. All Rights Reserved. 20

100+ known production deployments

AND MORE!

Confidential © Alluxio, Inc. All Rights Reserved. 21

Big Data Case Study –

Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency

SPARK

TERADATA

SPARK

TERADATA

Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W

Confidential © Alluxio, Inc. All Rights Reserved. 22

Big Data Case Study – Top 3 Retailer

Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency

SPARK

HDFS

SPARK

HDFS

Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh

Confidential © Alluxio, Inc. All Rights Reserved. 23

Consumer Intelligence Use Case – Top 3 Telco

Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On-prem & cloud. Integration through heavy ETL.

HADOOP

Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness

ML HADOOP

HDFS HDFS HDFS

ML

ETL

HDP

HDFS

CDH

HDFS

MAPR

HDFS

HDFS

Confidential © Alluxio, Inc. All Rights Reserved. 24

Big Data Case Study –

Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency

SPARK

Baidu File System

SPARK

Baidu File System

Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O

Confidential © Alluxio, Inc. All Rights Reserved. 25

Big Data Case Study –

Challenge – Gain end to end view of business with data distributed across geographies Data ETL was not possible due to regulatory concerns

SPARK SPARK

Solution – With Alluxio, data can be accessed without storing locally Impact – Higher operational efficiency and solved regulatory concerns

HDFS HDFS HDFS

HDFS HDFS HDFS

Confidential © Alluxio, Inc. All Rights Reserved. 26

Big Data Case Study –

Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency

SPARK

HDFS

Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq

CEPH

HDFS CEPH

FLINK SPARK FLINK

Confidential © Alluxio, Inc. All Rights Reserved. 27

Machine Learning Case Study –

Challenge – Disparate Data both on Prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach.

SPARK

HDFS

SPARK

MINIO

Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3

MES

OS

Confidential © Alluxio, Inc. All Rights Reserved. 28

Visualizing the Stack with Alluxio FAST 104 - 105 MB/s

Application

Remote Storage

MODERATE 103 - 104 MB/s

SLOW 102 - 103 MB/s

Alluxio MEM Often

Only When Necessary

Alluxio SSD/HDD Limited

Confidential © Alluxio, Inc. All Rights Reserved. 29

1.6.0/1 Release HIGHLIGHTSEcosystem Integrations

S3 client Python client

Performance Improvement Avoid unnecessary read when closing a file with partial caching on Improve data distribution when using DeterministicHashPolicy

Usability Improvements Audit logging Third party UFS connector management Dynamically adjusting log levels Remote logging

And Many More! http://www.alluxio.org/releases

Twi$er.com/alluxio

Linkedin.com/alluxio

Website www.alluxio.com

E-mail [email protected]

@

Social Media

á

Confidential © Alluxio, Inc. All Rights Reserved. 30

Contact: [email protected] Websites: www.alluxio.com and www.alluxio.org Demo: Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE