Click here to load reader

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

  • View
    82

  • Download
    0

Embed Size (px)

Text of Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

  • ENABLE FAST BIG DATA ANALYTICS ON CEPH WITH ALLUXIOAdit Madan

    March 2017

  • ABOUT ME

    Adit Madan, Software Engineer @ Alluxio, Inc

    Masters @ Carnegie Mellon University

    Bachelors @ Indian Institute of Technology, Delhi

    Email: [email protected]

    2

  • ALLUXIO INTRODUCTION

    3

  • FASTEST-GROWING BIG DATA PROJECT

    Fastest growing open-source project in the big data ecosystem

    400+ contributors from 100+ organizations

    Running worlds largest production clusters

    Welcome to join the community!

    4

  • BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIOBIG DATA ECOSYSTEM YESTERDAY

    FUSE Compatible File System

    Hadoop Compatible File System

    Native Key-Value Interface

    Native File System

    Enabling Application to Access Data from any Storage System at Memory-speed

    BIG DATA ECOSYSTEM ISSUES

    GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

    5

  • WHY ALLUXIO

    Co-located with compute, provides memory-speed access to data

    Virtualized across different storage systems under a unified global namespace

    Distributed system, scale-out architecture

    Software only, no change needed to existing application

    6

  • ALLUXIO BENEFITS

    Unification

    New workflows across any data in any storage system

    Orders of magnitude improvement in run time

    Choice in compute and storage grow each independently, buy only what is needed

    Performance Flexibility

    7

  • USE CASE ACCELERATE I/O TO/FROM REMOTE STORAGE

    8

    Compute and Storage Separation Advantages

    Meet different compute and storage hardware requirements efficiently

    Scale compute and storage independently Store data in Traditional filers/SANs and object

    stores cost effectively Compute on data in existing storage via Big Data

    Computational frameworks Disadvantage

    Accessing data requires remote I/O

  • USE CASE WITHOUT ALLUXIO

    9

    Spark

    Storage

    Low latency, memory throughput

    High latency, network throughput

  • USE CASE WITH ALLUXIO

    10

    Spark

    Storage

    AlluxioKeeping data in Alluxio accelerates data access

  • ACCELERATE I/O TO/FROM REMOTE STORAGE

    The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.- Baidu

    RESULTS

    Data queries are now 30x faster with Alluxio

    Alluxio cluster runs stably, providing over 50TB of RAM space

    By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

    Baidus PMs and analysts run

    interactive queries to gain insights

    into their products and business

    200+ nodes deployment

    2+ petabytes of storage

    Mix of memory + HDD

    ALLUXIO

    Baidu File System

    11

  • ALLUXIO ON CEPH

    12

  • ALLUXIO ON CEPH

    13

    Spark

    Ceph Object Storage

    Alluxio

    Connect using RADOS Gateway Swift Object Storage API

  • EC2 CONFIGURATION

    14

    1 Compute Master Spark and Alluxio Masters

    3 Compute Workers Spark and Alluxio Workers

    1 Storage Manager Ceph RadosGW and Monitor

    2 Storage Devices Ceph OSDs

    Instance type: r3.xlarge Availability Zone: us-east-1a

  • SOFTWARE VERSIONS

    15

    Ceph Version: 0.94.9

    Alluxio Version: 1.4.0 Custom JOSS library 0.9.13-SNAPSHOT

    Spark Version 1.6.1

  • DEMO OF THE SOLUTION

    16

    Spark, Alluxio and Ceph Cluster pre-deployed

    Ceph pre-populated with a 60GB dataset

    Launch spark shella. First countb. Second countc. d. Third count

    Ad-hoc queries w/ Alluxioa. wordcount w/ intermediate data

  • SPARK COUNT PERFORMANCE

    17

    Count on 60 GB dataset 20x improvement for repeated access

  • FOR MORE INFORMATION .

    18

    Please take a look at our Whitepaper!

    Blog: https://alluxio.com/blog/accelerating-data-analytics-on-ceph-object-storage-with-alluxio

    Whitepaper: https://alluxio.com/resources/accelerating-data-analytics-on-ceph-object-storage-with-alluxio

  • Thank you!Contact: [email protected] or [email protected]: @AlluxioWebsites: www.alluxio.com and www.alluxio.org

    19

Search related