23
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jason Woodlee, Datapipe Sr. Director Cloud Products Ilya Krammer, Datapipe Software Engineer October 2015 ISM213 Building and Deploying a Modern Big Data Architecture on AWS

(ISM213) Building and Deploying a Modern Big Data Architecture on AWS

Embed Size (px)

Citation preview

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jason Woodlee, Datapipe Sr. Director Cloud Products

Ilya Krammer, Datapipe Software Engineer

October 2015

ISM213

Building and Deploying a Modern Big Data

Architecture on AWS

What to Expect from the Session

Presentation Overview

• Cloud analytics

• Refactoring to Amazon EMR

• Technical learnings

• Testing in a cloud Hadoop world

• Organization learnings

• Future of analytics

Cloud Analytics

Acquired in 2013

Data-mining and

intelligence tool to

govern and analyze

your AWS environment

Architecture Time Capsule

• Multiple disparate apps

• Heavy utilization of agents and APIs

• Multiple queues and high-frequency polling

• Heavy concurrent MongoDb real time access

Architecture

The Big Data Problem

• Multiple consumers: dashboard, aggregated

reports, query interfaces

• 2 GB files for each of out 1500+ clients on hourly

basis

• 1000s of large API payloads per second

• 40+ VMs with distinct ETL processes

• Massive Mongo instances

The Big Data Problem, continued

• Processing was slow and error prone

• Infrastructure was mostly static with single points of

failure; maintenance was intense

• Release management became problematic

• Data store became a bottleneck to ETL and aggregation

• Always on MongoDB infrastructure became expensive

• Spend misaligned with client usage

“Why are we paying so

much for what is

essentially data at rest?”

Eureka Moment

Redesign Goals

Analytics

• Improve Performance

• Increase Scale

• Reduce Cost

Data Layer

• Increase Performance

• Reduce Cost

Reduce support footprint

Designing with EMR

• Separate raw data and user visible data

• EMR with Amazon S3 instead of MongoDB

• AWS services

• On-demand infrastructure

• Store user-visible data (low latency) on SSD drives with

TTLs for easy cleanup

Learnings

Learnings

Resource alignment: Wide transient clusters over static

clusters reduced our cost significantly and allow massive

scale

Static Hadoop EMR on Demand

m4.4xlarge instance Multiple jobs a day

Usage Min 20% ~ 4.8 hours ~ 4 hour a day

Monthly Cost $800 a month $ 170 a month

Learnings: Right Tool for the Job

With Amazon Elastic MapReduce we were

able to process 90% of the analytics in a

single pass through the whole dataset

40%

performance

improvement

EMR for performance

Learnings: Data Management

• Utilize Hadoop

merge instead

of lookups

• Pipeline

Hadoop to

normalized

data before

processing

Results

• Processing reduction 75%

• Cost reduction 80%

• Improved maintainability

Testing in a Cloud Hadoop World

Approach

• Old method of testing is incomplete

• Full data size needed to validate complex analytics

Best practices emerged

• Scripts and tooling developed to rapidly create

environments

• Different strategies and approaches to validate changes

Organizational Learnings

• Architecture drives adoption

• Early pioneers lead the charge

• Adoption is more complex than

traditional stacks

• Ramp-up of teams is much slower

• Amazon EMR is very effective for

rapid prototyping

Future of Analytics

Where Do We See Analytics Going?

Ecosystem alignment

• The Hadoop world is in a tug of war

between vendors and tools

• BI vendors and platform providers will

level to a small few, while open source

competitors and startups explode

Where Do We See Analytics Going?

Key areas of pain will be resolved

• Current challenges will get better

• Job management

• Data reporting

• Log consolidating

Where Do We See Analytics Going?

Managed service providers

• Experience in handling broad sets of

data from a large client base will

continue to enable MSPs such as

Datapipe to build expertise and evolve

consulting capabilities

Thank you!

Questions

Remember to complete

your evaluations!