Upload
alluxio-inc
View
380
Download
1
Embed Size (px)
Citation preview
The Architecture of Decoupling Compute and Storage with Alluxio Haoyuan Li, Alluxio Inc.
December 2017 @ Strata Singapore
Confidential © Alluxio, Inc. All Rights Reserved. 2
About Me
• Haoyuan (H.Y.) Li
• Founder and CEO at Alluxio, Inc.
• Created Alluxio (as name Tachyon) at UC Berkeley AMPLab, as Ph.D.
candidate.
• Google, Cornell University, Peking University
Confidential © Alluxio, Inc. All Rights Reserved. 3
Decoupling Compute From Storage
Benefits – Different compute and storage hardware requirements Scale compute and storage resources independently Traditional filers/SANs and cost effective object stores (Amazon S3, Google GCS, Microsoft Azure Blob Store) are inherently decoupled Fast-evolving big data eco-system
Challenges – Accessing data requires remote I/O
Confidential © Alluxio, Inc. All Rights Reserved. 4
Remote I/O Spark
Amazon S3
Every data operation requires data transfer,
sometimes over the WAN
High latency, network throughput
Confidential © Alluxio, Inc. All Rights Reserved. 5
Data Operations with Alluxio Spark
Amazon S3
Alluxio
Low latency, memory throughput
High latency, network throughput
Keeping data in Alluxio accelerates data
access
Confidential © Alluxio, Inc. All Rights Reserved. 6
Data Ecosystem Develops
• One Compute Framework
• Single Storage System
• Co-located ETL
ETL
ETL
Confidential © Alluxio, Inc. All Rights Reserved. 7
Data Ecosystem Explodes
…
• Many Compute Frameworks
• Many Storage Systems
• Most not co-located
…
Confidential © Alluxio, Inc. All Rights Reserved. 8
Data Ecosystem Issues
• Each app manages multiple data sources
• Data source changes require global updates
• Storage optimizations requires app change
• Poor performance due to lack of locality
…
…
Confidential © Alluxio, Inc. All Rights Reserved. 9
Data Ecosystem with Alluxio
• Apps only talk to Alluxio
• Simple Add/Remove
• No App Changes
• Highest performance in Memory
• No Lock in
Native File System Hadoop Compatible File System Amazon S3 Interface REST Web Service
HDFS Interface Amazon S3 Interface Swift Interface NFS Interface
…
…
Confidential © Alluxio, Inc. All Rights Reserved. 12
History
Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in early 2016
Open Sourced in 2013 Apache License 2.0 Latest Stable Release: Alluxio 1.6.1 (Nov 2017)
Confidential © Alluxio, Inc. All Rights Reserved. 13
Fastest Growing Big Data Open Source Project
Fastest Growing open-source project in the big data ecosystem
Running world’s largest production clusters
600+ Contributors from 100+ organizations
0
100
200
300
400
500
600
700
800 0 10 20 30 40 45 50 55
Num
ber
of C
ontr
ibut
ors
Open Source Contributors by Month (Github)
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
Confidential © Alluxio, Inc. All Rights Reserved. 14
Non-persistent data-storage
software.
What’s Alluxio
M e m o r y - S p e e d V i r t u a l D i s t r i b u t e d S t o r a g e
Scale out architecture.
Virtualized across different storage types under a unified namespace.
Memory-speed access to data.
Confidential © Alluxio, Inc. All Rights Reserved. 15
Alluxio Innovation: �
Unified NamespaceEnables effective data management across different Under Stores
Uses Mounting with Transparent Naming
Confidential © Alluxio, Inc. All Rights Reserved. 16
Alluxio Innovation: �
Unified NamespaceCreate a catalog of available data sources for Data Scientists
/finance/customer-transactions//finance/vendor-transactions//operations/device-logs//operations/phone-call-recordings//operations/check-images//research/us-economic-data//research/intl-economic-data//marketing/advertising-dataset//marketing/marketing-funnel-dataset/
alluxio://
Confidential © Alluxio, Inc. All Rights Reserved. 17
Alluxio Innovation: �
Server-side API TranslationConvert from Client-side Interface to native Storage Interface
HDFS Interface
HDFS Interface S3A Interface Swift Interface Google Cloud Interface
Confidential © Alluxio, Inc. All Rights Reserved. 18
Alluxio Innovation: �
Intelligent CacheLocal performance from remote data using multi-tier storage
RAM SSD HDD
Hot Warm Cold
Read & Write Buffering Transparent to App
Policies for pinning, promotion/demotion, TTL
Confidential © Alluxio, Inc. All Rights Reserved. 19
Alluxio Benefits
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
Confidential © Alluxio, Inc. All Rights Reserved. 21
Big Data Case Study –
Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution – ETL Data from Teradata to Alluxio Impact – Faster Time to Market – “Now we don’t have to work Sundays” Use Case: http://bit.ly/2oMx95W
Confidential © Alluxio, Inc. All Rights Reserved. 22
Big Data Case Study – Top 3 Retailer
Challenge – Bottleneck in Trend Analysis of mission critical daily sales and inventory management Queries were slow / not interactive, resulting in operational inefficiency
SPARK
HDFS
SPARK
HDFS
Solution – With Alluxio, data queries are 10X faster Impact – Higher operational efficiency Use case: http://bit.ly/2ook8Nh
Confidential © Alluxio, Inc. All Rights Reserved. 23
Consumer Intelligence Use Case – Top 3 Telco
Challenge – Desired a central view of consumer information in near real time for proactive support. Many HDFS, different distributions, many incompatible versions. On-prem & cloud. Integration through heavy ETL.
HADOOP
Solution – Alluxio integrates data into central catalog for fast access to consumer interaction records. Impact – Reduced integration time Faster data speed & freshness
ML HADOOP
HDFS HDFS HDFS
ML
ETL
HDP
HDFS
CDH
HDFS
MAPR
HDFS
HDFS
Confidential © Alluxio, Inc. All Rights Reserved. 24
Big Data Case Study –
Challenge – Gain end to end view of business with large volume of data Queries were slow / not interactive, resulting in operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution – With Alluxio, data queries are 30X faster Impact – Higher operational efficiency http://bit.ly/2pDHS3O
Confidential © Alluxio, Inc. All Rights Reserved. 25
Big Data Case Study –
Challenge – Gain end to end view of business with data distributed across geographies Data ETL was not possible due to regulatory concerns
SPARK SPARK
Solution – With Alluxio, data can be accessed without storing locally Impact – Higher operational efficiency and solved regulatory concerns
HDFS HDFS HDFS
HDFS HDFS HDFS
Confidential © Alluxio, Inc. All Rights Reserved. 26
Big Data Case Study –
Challenge – Gain end to end view of business with large volume of data for $5B Travel Site Queries were slow / not interactive, resulting in operational inefficiency
SPARK
HDFS
Solution – With Alluxio, 300x improvement in performance Impact – Increased revenue from immediate response to user behavior Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
Confidential © Alluxio, Inc. All Rights Reserved. 27
Machine Learning Case Study –
Challenge – Disparate Data both on Prem and Cloud. Heterogeneous types of data. Scaling of Exabyte size data. Slow due to disk based approach.
SPARK
HDFS
SPARK
MINIO
Solution – Using Alluxio to prevent I/O bottlenecks Impact – Orders of magnitude higher performance than before. http://bit.ly/2p18ds3
MES
OS
Confidential © Alluxio, Inc. All Rights Reserved. 28
Visualizing the Stack with Alluxio FAST 104 - 105 MB/s
Application
Remote Storage
MODERATE 103 - 104 MB/s
SLOW 102 - 103 MB/s
Alluxio MEM Often
Only When Necessary
Alluxio SSD/HDD Limited
Confidential © Alluxio, Inc. All Rights Reserved. 29
1.6.0/1 Release HIGHLIGHTSEcosystem Integrations
S3 client Python client
Performance Improvement Avoid unnecessary read when closing a file with partial caching on Improve data distribution when using DeterministicHashPolicy
Usability Improvements Audit logging Third party UFS connector management Dynamically adjusting log levels Remote logging
And Many More! http://www.alluxio.org/releases
Twi$er.com/alluxio
Linkedin.com/alluxio
Website www.alluxio.com
E-mail [email protected]
@
Social Media
á
�
Confidential © Alluxio, Inc. All Rights Reserved. 30
Contact: [email protected] Websites: www.alluxio.com and www.alluxio.org Demo: Spark + Alluxio + S3 https://youtu.be/QVtxDpA-jis Alluxio Unified Namespace https://youtu.be/lIXpNK4VxqE