43
Peter-Mark Verwoerd Big Data on AWS Solutions Architect

Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

  • Upload
    buidung

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Peter-Mark Verwoerd

Big Data on AWS

Solutions Architect

Page 2: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

What to get out of this talk

• Non-technical:

– Big Data processing stages: ingest, store, process, visualize

– Hot vs. Cold data

– Low latency processing vs. high latency processing

• Technical:

– Concepts above

– Big Data reference architectures and design patterns

Page 3: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

GB TB PB

ZB

EB

The World is Producing Ever-Larger Volumes of

Big Data

• IT/ Application server logs IT Infrastructure logs, Metering, Audit logs, Change logs

• Web sites / Mobile Apps/ Ads Clickstream, User Engagement

• Sensor data Weather, Smart Grids, Wearables

• Social Media, User Content 450MM+ Tweets/day

Page 4: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Big Data

• Hourly server logs: how your systems were misbehaving an hour ago

• Weekly / Monthly Bill: What you spent this past billing cycle?

• Daily customer-preferences report from

your web-site’s click stream: tells you what deal or ad to try next time

• Daily fraud reports: tells you if there was fraud yesterday

Real-time Big Data

• CloudWatch metrics: what just went

wrong now

• Real-time spending alerts/caps:

guaranteeing you can’t overspend

• Real-time analysis: tells you what to offer

the current customer now

• Real-time detection: blocks fraudulent

use now

Big Data : Best Served Fresh

Page 5: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

The Challenge

Data Big Data Real-time Big Data = Plethora of tools

Page 6: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

The Zoo

Apache Kafka

Amazon Kinesis

Apache Flume

Storm

Apache Spark

Apache Spark

Streaming

Hadoop/EMR

Redshift S3

DynamoDB

Hive Pig Shark

HDFS

Impala

?

Page 7: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Partners

Flume, Sqoop

HParser

Page 8: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Simplify

Kinesis

Flume

Scribe

Jaspersoft

Kafka Tableau

Ingest Visualize

Data Answers

Storm

SharkSpark

Spark Streaming

Hive/PigHadoop/

EMR

Process

HDFS

DynamoDB

Redshift

S3

Store

Page 9: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Ingest

IngestData

Page 10: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Ingest

• The act of collecting and storing data

Ingest

Ingest

Page 11: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Why Data Ingest Tools?

• Collect random and high velocity data

– Many different sources

– High TPS

• Collecting random and high velocity data is a challenging task

– Hard to durably store data at scale

– Hard to keep highly available

– Hard to scale

Page 12: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Why Data Ingest Tools?

• Data ingest tools convert random streams of data into

fewer set of sequential streams

– Sequential streams are easier to process

– Easier to scale

– Easier to persist

Processing

Kafk

aO

rKin

esis

Processing

Page 13: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Data Ingest Tools

• Facebook Scribe Data collectors

• Amazon Kinesis Data collectors

• Apache Kafka Data collectors

• Apache Flume Data Movement and Transformation

Page 14: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Partners – Data Load and Transformation

Big Data Edition

Flume, Sqoop

HParser

Page 15: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Storage

Ingest StoreData

Page 16: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Storage

Structured – Complex Query

• SQL

– Amazon RDS (MySQL, Oracle, SQL Server, Postgres)

• Data Warehouse

– Amazon Redshift

• Search

– Amazon CloudSearch

Unstructured – Custom Query

• Hadoop/HDFS

– Amazon Elastic MapReduce

(EMR)

Structured – Simple Query

• NoSQL

– Amazon DynamoDB

• Cache

– Amazon ElastiCache (Memcached, Redis)

Unstructured – No Query

• Cloud Storage

– Amazon S3

– Amazon Glacier

Page 17: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Amazon RDS

Amazon Redshift

Amazon S3

Request rate High Low

Cost/GB High Low

Latency Low High

Data Volume Low High

Amazon Glacier

Amazon EMR

Stru

ctu

re

Low

High

Amazon DynamoDB

Amazon ElastiCache

Page 18: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Elasti- Cache

Amazon DynamoDB

Amazon RDS

Cloud Search

Amazon Redshift Amazon EMR (Hive)

Amazon S3 Amazon Glacier

Average latency

ms ms ms,sec ms,sec sec,min sec,min, hrs

ms,sec,min (~ size)

hrs

Data volume GB GB–TBs (no limit)

GB–TB (3 TB Max)

GB–TB TB–PB (1.6 PB max)

GB–PB (~nodes)

GB–PB (no limit)

GB–PB (no limit)

Item size B-KB KB (64 KB max)

KB (~rowsize)

KB (1 MB max)

KB (64 K max)

KB-MB KB-GB (5 TB max)

GB (40 TB max)

Request rate Very High Very High High High Low Low Low– Very High (no limit)

Very Low (no limit)

Cost ($/GB/month)

$$ ¢¢ ¢¢ $ ¢

¢ ¢ ¢

Durability Low - Moderate

Very High High High High High Very High Very High

Page 19: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Process

Ingest Store ProcessData

Page 20: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Process

• Answering questions about data

• Questions

– Analytics: Think SQL/Data warehouse

– Classification: Think Sentiment Analysis

– Predication: Think page-views Prediction

– Etc

Page 21: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Processing Frameworks

• Generally come in two major types

– Batch processing

– Stream processing

Page 22: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Processing Frameworks

• Batch Processing

– Take large amount (>100TB) of cold data and ask questions

– Takes hours to get answers back

Example: Generating Monthly AWS Billing Reports

Page 23: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Processing Frameworks

• Stream Processing (aka. Real-time)

– Take small amount of hot data and ask questions

– Takes short amount of time to get your answer back

Example: Cloudwatch 1min metrics

Page 24: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Processing Frameworks

• Hadoop/EMR Batch Processing

• Spark Batch Processing

• Spark Streaming Stream Processing

• Storm Stream Processing

• Redshift Batch Processing

Page 25: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Impala

Partners – Advanced Analytics

Page 26: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Visualize

Ingest Store ProcessData Visualize

Page 27: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Which country consumes the most oil?

What countries are oil exporters?

Is there a trend of increasing oil consumption

over time?

Order countries by oil consumption/production?

Is there a cluster of oil producers?

What is the oil consumption of USA per day?

What is the average oil consumption per day of

Europe?

Are there any

outliers?

What is the rage of oil production?

What is the distribution of oil producing countries?

Activities of Data Visualization Users

Page 28: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure
Page 29: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Partners – BI & Data Visualization

Page 30: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Putting it all together (coupled architecture)

• Ingest/Store and processing tightly coupled

• Examples:

– S3 + EMR/Hadoop

– HDFS + EMR/Hadoop

– S3 + Redshift

Page 31: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Putting it all together (coupled architecture)

• Coupled systems provide Less flexibility

– Cold data vs. Hot

– High latency processing vs. Low latency processing

• Example

– EMR+HDFS/S3

• Cold: Can handle processing 100 records/sec

• Hot: processing 1000000 records/sec ??

– Redshift + S3

• High latency: Generate reports once a day

• Low latency: Generate reports every minute

Page 32: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Putting it all together (de-coupled architecture)

• Multi-tier data processing architecture

– Similar to multi-tier web-application architectures

• Ingest & Store de-coupled from Processing

– Concept of “databus”

DatabusData Process Answers

Page 33: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Putting it all together (de-coupled architecture)

• Ingest tools write to multiple data stores within “data-bus”

• Processing frameworks (Hadoop, Spark, etc) consume from “databus”

• Consumers can decide which data store to read from depending on

their data processing requirement

Ingest Store

Data Process AnswersKafka

S3

HDFS

Page 34: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Data temperature & processing latency

Page 35: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Pattern 1: Redshift (cold & high)

Page 36: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Pattern 2: DynamoDB (warm and low)

Page 37: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Pattern 3: Hadoop (cold and high)

Page 38: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Pattern 4: Hadoop (warm and low)

Page 39: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Pattern 5: Spark (cold and low)

Page 40: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Pattern 6: Stream Processing (hot and low)

Page 41: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Putting it All Together

Page 42: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

What to get out of this talk

• Non-technical:

– Big Data processing stages: ingest, store, process, visualize

– Hot vs. Cold data

– Low latency processing vs. high latency processing

• Technical:

– Concepts above

– Big Data reference architectures and design patterns

Page 43: Big Data on AWS - Meetupfiles.meetup.com/11363042/big_data_meetup.pdf · GB TB PB ZB EB The World is Producing Ever-Larger Volumes of Big Data •IT/ Application server logs IT Infrastructure

Questions?