Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013

Preview:

DESCRIPTION

As troves of data grow exponentially, the number of analytical jobs that process the data also grows rapidly. When you have large teams running hundreds of analytical jobs, coordinating and scheduling those jobs becomes crucial. Using Amazon Simple Workflow Service (Amazon SWF) and AWS Data Pipeline, you can create automated, repeatable, schedulable processes that reduce or even eliminate the custom scripting and help you efficiently run your Amazon Elastic MapReduce (Amazon EMR) or Amazon Redshift clusters. In this session, we show how you can automate your big data workflows. Learn best practices from customers like Change.org, KickStarter and UnSilo on how they use AWS to gain business insights from their data in a repeatable and reliable fashion.

Citation preview

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

SVC201 - Automate Your Big Data Workflows

Jinesh Varia, Technology Evangelist

@jinman

November 14, 2013

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Big Data Workflows

Automating Compute Automating Data

DeciderWorker Starters

Activity Worker

Amazon SWF

Activity Worker

AWS Management ConsoleHistory

Amazon SWF – Your Distributed State Machine in the Cloud

SWF helps you scale your business logic

Tim JamesVijay Ramesh

- Data/science Architect, Manager- Data/science Engineer

the world's largest petition platform

At Change.org in the last year

• 120M+ signatures — 15% on victories• 4000 declared victories

This works.

How?

60-90% signatures at Change.orgdriven by email

This works.

This works.* up to a point!

*

Manual Targeting doesn’t

scale.

Manual Targeting doesn’t scale

cognitively.

Manual Targeting doesn’t scale

in personnel.

Manual Targeting doesn’t scale

into mass customization.

Manual Targeting doesn’t scale

culturally or internationally.

Manual Targeting doesn’t scale

with data size and load.

So what did we do?

We used big-compute machine learning to automatically target our mass emails across each week’s set of campaigns.

We started from here...

And finished here...

First: Incrementally extract (and verify) MySQL data to Amazon S3

Best Practice:

Incrementally extract with high watermarking.

(not wall-clock intervals)

Best Practice:

Verify data continuity after extract.

We used Cascading/Amazon EMR + Amazon SNS.

Transform extracted data on S3 into “Feature Matrix”using Cascading/Hadoop on Amazon Elastic MapReduce100-instance EMR cluster

A Feature Matrix is just a text file.

Sparse vector file line format, one line per user.

<user_id>[ <feature_id>:<feature_value>]...

Example:

123 12:0.237 18:1 101:0.578

So how do we do

big-compute Machine Learning?

Enter Amazon • Simple Workflow Service• Elastic Compute Cloud

SWFEC2

SWF and EC2 allowed us to decouple:

• Control (and error) flow

• Task business logic• Compute resource provisioning

SWF provides a distributed application model

Decider processes make discrete workflow decisions

Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)

SWF provides a distributed application model

Allows deciders and workers to be implemented in any language.  

We used Rubywith ML calculations done by Python, R, or C.

SWF provides a distributed application model

Rich web interface via the AWS Management Console.

Flexible API for control and monitoring.

SWF provides a distributed application model

Resource Provisioning with EC2

Our EC2 instances each provide servicevia Simple Workflow Service

for a single Feature Matrix file.

Simplifying Assumption:

Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)

Best Practice:

Treat compute resources as

hotel rooms, not mansions.

Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI)

EC2 instance tags provide highly visible, searchable configuration.

Update local git repo to configured software version.

EC2 instance tags

Best Practice:

Log bootstrap steps to S3mapping essential config tags to EC2 instance names and log files

Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.

Provisioning in R&D for Training

• Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space

• Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python

Provisioning in Production

• Train with single SWF worker using multiple cores (python multithreaded Random Forest)

• Predict with 8 SWF workers — 1 per core, 4 cores per instance

Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group

Provisioning in Production

Best Practice:

Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.

Forward scale

So from here,

how can we expect this system to scale?

Forward scale

• Run more EMR instances to build Feature Matrix

• Run more SWF predict workersper campaign

for 10x users

Forward scale

• already automatically start a SWF worker group per campaign

• “user generated campaigns” require no campaigner time and are targeted automatically

for 10x campaigns

Forward scale

• system eliminates mass email targeting contention, so team can scale

for 2x+ campaigners

Win for our Campaigners... and Users.

Our user base can now be automatically segmented across a wide pool of campaigns, even internationally.

30%+ conversion boost over manual targeting.

Do you build systems like these?Do you want to?

We’d love to talk.(And yes, we’re hiring.)

UNSILODr. Francisco Roque, Co-Founder and CTO

A collaborative search platform that helps you see patterns across Science & Innovation

Mission

UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific

terminologies

Describe Discover Analyze & Share

Unsilo

New way of searching

Big Data Challenges

4.5 million USPTO granted patents

12 million scientific articles

Heterogeneous processing pipeline

(multiple steps, variable times)

A small test

1000 documents20 minutes/doc average

A bigger test

100k documents3.8 years?

A bigger test

100k documents8x8 cores~21 days

4.5 million patents?12 million articles?

Focus on the goal

Amazon SWF to the rescue

• Scaling• Concurrency• Reliability• Flexibility to experiment• Easily adaptable

SWF makes it very easy to separate algorithmic logic and workflow logic

Easy to get started: First document batch running in just 2 weeks

AWS services

Adding content

Job Loading

• Content loaded by traversing S3 buckets

• Reprocessing by traversing tables on DynamoDB

DynamoDB

Decision Workers

• Crawls Workflow Historyfor Decision Tasks

• Schedules new ActivityTasks

DynamoDB

Activity Workers

• Read/write to S3• Status in DynamoDB• SWF task inputs passed

between workflow steps • Specialized workers

DynamoDB

Best practice

Use DynamoDB for content status

Index on different columns (local indexes)

More efficient content status queriesGive me all the items that completed step X

Elastic service!

Key to scalability

File organization on S3 for scalability– 50 req/s naïve approach– >1500 req/seq

logs/2013-11-14T23:01:34/...logs/2013-11-14T23:01:23/...logs/2013-11-14T23:01:15/..."

43:10:32T41-11-3102/logs/...32:10:32T41-11-3102/logs/...51:10:32T41-11-3102/logs/..."

http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.htmlhttp://goo.gl/JnaQZV

Gearing

Ratio?

Monitoring

Give me all the workers/instances that have not responded in the past hour

Amazon SWF components

DynamoDB

Throttling and eventual consistency

Failed?Try Again

Development environment

Huge benefits

100k Documents21 days < 1 hour

4.5 Million USPTO~30 hours

Huge benefits

Focus on our goal, faster time to market

Using Spot instances, 1/10 cost

Key SWF Takeaways

Flexibility– Room for experimentation

Transparency– Easy to adapt

Growing with the system– Not constrained by the framework

Decider

Worker

Worker

Amazon SWF

UNSILOwww.unsilo.com

Sign up to be invited for the Public Beta

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Compute Resources

Data Data

Data Stores Data Stores

AWS Data Pipeline Your ETL in the Cloud

Inter-region ETL

S3 EMR S3 DynamoDBEMRS3 S3 RDSEC2

S3 RedshiftEMR DynamoDBEMRDynamoDB S3 Hive/Pig Redshift

Intra-region ETL Cloud-On-Prem ETL

AWS Data Pipeline Patterns (ActivityWorkers)

Fred Benenson, Data Engineer

A new way to fund creative projects:

All-or-nothing fundraising.

5.1 million people have backed a project

51,000+ successful projects

44% of projects hit their goal

$872 million pledged

78% of projects raise under $10,000

51 projects raised more than $1 million

Project case study: Oculus Rift

Data @

• We have many different data sources

• Some relational data, like MySQL on Amazon RDS

• Other unstructured data like JSON stored in a

third-party service like Mixpanel

• What if we want to JOIN between them in Amazon

Redshift?

Case study: Find the users that have Page View A but not User Action B

• Page View A is instrumented in Mixpanel, a third-party service whose API we have access:

{ “Page View A”, { user_uid : 1231567, ... } }

• But User Action B is just the existence of a timestamp in a MySQL row:

6975, User Action B, 1231567, 2012-08-31 21:55:466976, User Action B, 9123811, NULL6977, User Action B, 2913811, NULL

Redshift to the Rescue!SELECTusers.id,COUNT(DISTINCTCASE WHEN user_actions.timestamp IS NOT NULLTHEN user_actions.id ELSE NULL

END) as event_b_countFROM usersINNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A'

LEFT JOIN user_actions ON user_actions.user_id = users.idGROUP BY users.id

How we do automate the data flow to keep it fresh daily?

But how do we get the data to Redshift?

This is where AWSData Pipeline comes in.

Pipeline 1: RDS to Redshift - Step 1

First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.

AWS

Pipeline 1: RDS to Redshift - Step 2

Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.

• 150 - 200 gigabytes• New DB every day, drop old tables

• Using AWS Data Pipeline’s 1-day ‘now’ schedule

Pipeline 1: RDS to Redshift - Transfer to S3

Pipeline 1: RDS to Redshift Again

Run a similar pipeline job in parallel for our other database.

Pipeline 2: Mixpanel to Redshift - Step 1

Spin up an EC2 instance to download the day’s data from Mixpanel.

Use Elastic MapReduce to transform Mixpanel’s unstructured JSON into CSVs.

Pipeline 2: Mixpanel to Redshift - Step 2

• 9-10 gb per day• Incremental data• 2.2+ billion events• Backfilled a year in 7 days

Pipeline 2: Mixpanel to Redshift - Transfer to S3

• JSON / CLI tools are crucial• Build scripts to generate JSON• ShellCommandActivity is powerful• Really invest time to understand

scheduling• Use S3 as the “transport” layer

AWS Data PipelineBest Practices

AWS Data Pipeline Takeaways for Kickstarter

15 years ago: $1 million or more

5 years ago: Open source + staff & infrastructure

Now: ~$80 a month on AWS

“It just works”

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Decider

Worker

AWS Data Pipeline

Activity Data Node

Worker

Amazon SWF

Automating Compute Automating Data

Automating Big Data Workflows

Big Thank You to Customer Speakers!

Jinesh Varia

@jinman

More Sessions on SWF and AWS Data Pipeline

SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room)

BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

SVC201

Recommended