33
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Gaurav Kumar, Christian Lam, David Winters November 30, 2016 Taking Data to the Extreme MBL202

AWS re:Invent 2016: Taking Data to the Extreme (MBL202)

Embed Size (px)

Citation preview

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Gaurav Kumar, Christian Lam, David Winters

November 30, 2016

Taking Data to the ExtremeMBL202

David Winters

Big Data Architect,

Data Science &

Engineering, GoPro

Gaurav Kumar

Product Lead,

Data Science &

Engineering, GoPro

Christian Lam

Analytics Engineer,

Data Science &

Engineering, GoPro

Growing data needs

Origin Story

•Make Friends

•Haul Ass

•Maintain Balance

•No Half-Assery

•Integrity. Always

•Be a HERO

Yes, this comes from the top…

High-Level Architecture

ETL Cluster

• Aggregations and Joins

• Hive

• Map/Reduce

Secure Data Mart Cluster

• End User Query

• Impala/Sentry

• Parquet

Analytics Apps

•Hue

•Tableau

•Python

•R

Streaming Ingest Cluster

•Log file streaming

•RESTful service

•Kafka

•Spark Streaming

•HBase

Batch Induction Framework

• Batch files

• Scheduled downloads

• Pre-processing

• Java App

Original

Cluster

JSON

JSON

Parquet

DDL

Data Pipeline

Streaming Ingest Cluster

ELBHTTP

Pipeline for processing of streaming logs

To ETL Cluster

Data Pipeline

/path1/…

/path2/…

/path3/…

To E

TL

Clu

ste

r

/path4/…

Data Pipeline

ETL Cluster

HDFS

Hive Metastore

To SDM Cluster

From Streaming

Ingest Cluster

Batch

Induction

Framework

Data Delivery!

HDFS

Hive Metastore

Applications

Thrift

ODBC

Server

UserStudio

Studio - Staging

GDA

Report

SDM Cluster

From ETL Cluster

Areas for Improvement

• Isolation of workloads• Fast ingest• Secure• Fast delivery/queries• Loosely coupled clusters

• Multiple copies of data• Tightly coupled storage and compute• Lack of elasticity• Operational overhead of multiple clusters

Amazon S3

bucket

Future Architecture

Streaming

Ingest Cluster

Batch Induction Framework

Hive

Metastore

Ephemeral

ETL

Cluster

JSON

Parquet+

DDL

Aggregates

Events+

StateEphemeral

Data Mart

Cluster #1

Ephemeral

Data Mart

Cluster #2

Ephemeral

Data Mart

Cluster #N

Visualization Categories

3 categories for all visualizations…

OperationsPreETL Analytics

21

Operations

Data Ops

Operations

Dataopsdashboardsallowustomonitorthehealthofdatastreamsanddetectanomalies,aswellasTableauServeritself.

Operations

23

PRE BUILT AGGREGATIONS

• Aggregates are

important for the

successful adoption

of Tableau

• Example: Karma

flight table

Operations

FEED PRODUCT

INSIGHTS

• Product insights can

result in product design

changes

• How we observed

Camera as A Hub

upload behaviors

Analytics

Reporting

Analytics

CAMERA

CONNECTS…

• Help us with

figuring out

stolen/smuggled

cameras

• Where we

should put our

new marketing

dollars for new

products

DASHBOARDS (GoPro Plus SUBSCRIPTIONS)

• Real-time

subscriber growth

• Utilizes Hive

External Tables

and JSON SERDE

Analytics

External Tables

Pre ETL

PG #

RC Playbook: Your guide to success at GoPro

Questions?

Thank you!

Remember to complete

your evaluations!