Download pdf - Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIOAdit Madan

March 2017

ABOUT ME

Adit Madan, Software Engineer @ Alluxio, Inc

Master’s @ Carnegie Mellon University

Bachelor’s @ Indian Institute of Technology, Delhi

Email: [email protected]

2

ALLUXIO INTRODUCTION

3

FASTEST-GROWING BIG DATA PROJECT

• Fastest growing open-source project in the big data ecosystem

• 400+ contributors from 100+ organizations

• Running world’s largest production clusters

• Welcome to join the community!

4

BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY

…

…

FUSE Compatible File System

Hadoop Compatible File System

Native Key-Value Interface

Native File System

Enabling Application to Access Data from any Storage System at Memory-speed

BIG DATA ECOSYSTEM ISSUES

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

5

WHY ALLUXIO

Co-located with compute, provides memory-speed access to data

Virtualized across different storage systems under a unified global namespace

Distributed system, scale-out architecture

Software only, no change needed to existing application

6

ALLUXIO BENEFITS

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

7

USE CASE – ACCELERATE I/O TO/FROM REMOTE STORAGE

8

• Compute and Storage Separation• Advantages• Meet different compute and storage hardware

requirements efficiently• Scale compute and storage independently• Store data in Traditional filers/SANs and object

stores cost effectively• Compute on data in existing storage via Big Data

Computational frameworks• Disadvantage• Accessing data requires remote I/O

USE CASE WITHOUT ALLUXIO

9

Spark

Storage

Low latency, memory throughput

High latency, network throughput

USE CASE WITH ALLUXIO

10

Spark

Storage

AlluxioKeeping data in Alluxio accelerates data access

ACCELERATE I/O TO/FROM REMOTE STORAGE

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.- Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster runs stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

11

ALLUXIO ON CEPH

12

ALLUXIO ON CEPH

13

Spark

Ceph Object Storage

Alluxio

● Connect using RADOS Gateway ○ Swift Object Storage API

EC2 CONFIGURATION

14

● 1 Compute Master○ Spark and Alluxio Masters

● 3 Compute Workers○ Spark and Alluxio Workers

● 1 Storage Manager○ Ceph RadosGW and Monitor

● 2 Storage Devices○ Ceph OSDs

● Instance type: r3.xlarge● Availability Zone: us-east-1a

SOFTWARE VERSIONS

15

● Ceph Version: 0.94.9

● Alluxio Version: 1.4.0○ Custom JOSS library 0.9.13-SNAPSHOT

● Spark Version 1.6.1

DEMO OF THE SOLUTION

16

● Spark, Alluxio and Ceph Cluster pre-deployed

● Ceph pre-populated with a 60GB dataset

● Launch spark shella. First ‘count’b. Second ‘count’c. <Restart shell>d. Third ‘count’

● Ad-hoc queries w/ Alluxioa. ‘wordcount’ w/ intermediate data

SPARK COUNT PERFORMANCE

17

Count on 60 GB dataset● 20x improvement for repeated access

FOR MORE INFORMATION ….

18

Please take a look at our Whitepaper!

● Blog: https://alluxio.com/blog/accelerating-data-analytics-on-ceph-object-storage-with-alluxio

● Whitepaper: https://alluxio.com/resources/accelerating-data-analytics-on-ceph-object-storage-with-alluxio

Thank you!Contact: [email protected] or [email protected]: @AlluxioWebsites: www.alluxio.com and www.alluxio.org

19