27
Serverless Data Analytics with Flint YOUNGBIN KIM AND JIMMY LIN

Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Serverless Data Analytics with Flint

YOUNGBIN KIM AND JIMMY LIN

Page 2: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Large-scale analytical data processing

o Spark adoption is booming

o Many use cases

Page 3: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Large-scale analytical data processing

o Spark adoption is boomingo Many use cases=> Requirement: Pre-installed on a cluster before they can be used for analytics

◦ On-premise data center or cluster of virtual instances in the cloud

Page 4: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Large-scale analytical data processingo Problem: Cluster management can be difficulto monitoring the health of worker nodeso troubleshoot a variety of issueso fixing/replacing underperforming nodes

May not be feasible for many small startups/researchers with the limited resources !

Page 5: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Large-scale analytical data processingo How about scaling?

Page 6: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Managed big data frameworkso Current solution: Managed big data frameworks

o Example: Amazon Elastic Map Reduce (EMR)

o Advantages:

o Reduces the burden of cluster management

o Save costs (automatically terminated)

o Limitations:

o Time is wasted in cluster initialization/rescaling/teardown

o Need to choose the details of the managed cluster

o There are still management overheads & idle costs.

Page 7: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Serverless analyticso Serverless analytics to the rescue!

Worker NodeExecutor Cache

Task

Cluster Manager

Task

Worker NodeExecutor Cache

Task Task

Page 8: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Flint

o Flint: prototype execution engine for serverless PySparko PySpark with serverless backend by simply specifying a config fileo No costs for idle capacityo Simplicityo Use cases: ad hoc analytics and exploratory data analysis

Page 9: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Flint architectureo Spark tasks are executed in AWS Lambdao Intermediate data are held in Amazon’s Simple Queue Service (SQS)o Reuses as many existing Spark components as possibleoQuery planning and optimizationoMany different types of RDD transformations

Page 10: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Workflow

Page 11: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

QueueQueue

Spark Context

Client

FlintScheduler Backend

SQS

FlintExecutor

FlintExecutor

Output Partition

Output PartitionFinal Stage

Intermediate Stage

FlintExecutor LambdaFlint

ExecutorFlint

Executor

Input Partition S3

Input Partition

Input Partition

Amazon Web Services

Data MovementControl Flow

Flint architecture

Page 12: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Flint architectureo The Flint scheduler coordinates Flint executors to execute a

particular physical planoFunction registrationoQueue initializationoSerializationo Invocation using thread pooloProcess the response from an executor

Page 13: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Flint executoro Flint executor is a python process running inside an Amazon Lambda

functionoEach serverless compute function invocation processes a single taskoSimplifies the communication requirement between an executor

and a driveroLess affected by the limitation of execution time

Page 14: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Remote storage for shuffling◦ No permanent storage◦ Small ephemeral disk space (~512 MB)◦ Execution time limitation

=> Cannot guarantee the Flint executors from the previous stage are still alive to pass data

◦ Communication between Lambda functions

◦ Amazon’s Simple Queue Service (SQS)◦ highly-scalable◦ reliable

Page 15: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Experimental Evaluation◦ A Spark cluster running the Databricks Unified Analytics Platform

(Standard)◦ 11 m4.2xlarge instances (one driver and ten workers) - 80 vCores◦ 80 max concurrent invocations (~ 80 vCores)

Page 16: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Experimental Evaluation◦ NYC taxi dataset (215 GB)◦ Pick-up and drop-off dates/time, trip distance, payment type, tip

amount◦ Queries inspired by an exploratory data analysis task described in a

popular blog post by Todd Schneider

Page 17: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Experimental Evaluation◦ Q0: Line count◦ Q1: Taxi drop-offs at the Goldman Sachs headquarters (hourly

aggregation)

Page 18: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Experimental Evaluation

◦ Q2: Similar to Q1, but for Citigroup headquarters◦ Q3: Goldman Sachs taxi drop-offs with tips greater than $10◦ Q4: Cash vs. credit card payments◦ Q5: Yellow taxi vs. green taxi , monthly aggregation◦ Q6: Effect of precipitation on taxi trips

Page 19: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Experimental Evaluation

Page 20: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Experimental Evaluation• Q1: Taxi drop-offs at the Goldman Sachs headquarters (hourly aggregation)• Q3: Goldman Sachs taxi drop-offs with tips greater than $10tradeoff of concurrency between the latency and the cost

Page 21: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Lambda Limitations◦ Most serverless platforms currently have several limitations

◦ Memory size (e.g. 3008 MB for AWS)◦ Execution time limitation (5 ~ 9 minutes)◦ Cold start problem

Page 22: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Record Record

Record

Record

Record

Record

Record

Record

Record

Record

Record

Record

Timeout

Input

Processed by another executor

Driver

{ bucket: ..., range: 0-1000, …}

{ status: incomplete, range: 0-719, …}

Record

Record

Record

Record{ bucket: ..., range: 720-1000, …}

{ status: complete, ...}

Lambda LimitationsExecution time limitation

Page 23: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Lambda Limitations

Page 24: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Lambda LimitationsOther constraints◦ Memory (3008 MB)◦ Request size: 6 MB

◦ Metadata◦ Response size: 6 MB

◦ Collect

Page 25: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Related Worko Iris:o The origin of Flint (A course project, UWaterloo, Fall 2016)o Distributed computation framework supporting a subset of Spark API

o In-browser data analytics backed by serverless backend

o Amazon Athenao Per-query pricing with zero idle costso Only supports SQL

o Presto distributed SQL engine

o Databricks Serverlesso Automatically managed pools of cloud resources

◦ auto-configured & auto-scaled

Page 26: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Related Worko PyWren

◦ Framework built from scratch on top of serverless compute functions and persistent storage

o Qubole Spark on Serverlesso Ported the existing Spark executor infrastructure onto AWS

Lambda, whereas Flint is a from-scratch implementationoCommunication modeloAWS Lambda limitations

Page 27: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only

Future Work◦ Intensive shuffling tasks◦ Robustness◦ Higher level libraries (e.g. MLlib)