25
Harness the power of your data: Build a next generation data platform on AWS Raghu Prabhu Global Manager, Business Development for Data Lakes

Harness the power of your data

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Harness the power of your data

Harness the power of your data:Build a next generation data platform on AWS

Raghu PrabhuGlobal Manager, Business Development for Data Lakes

Page 2: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda

• Why do you need a next generation data platform?

• Why should you build on AWS?

• Business Outcomes & Sample Reference Architectures• Vaguard

• Epic Games

• Asurion

• Salesforce

Page 3: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Why do you need a next generation data platform?

Page 4: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Companies want more value from their data

Complications:

Siloed approaches don’t work anymore

It’s too expensive and limiting to store data on-premises

Data is:

Implication:

A new approach is needed to extract insights and value

Growing exponentially

From new sources

Increasingly diverse

Used by many people

Analyzed by many applications

Page 5: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Traditionally, analytics looked like this

Relational data

GBs-TBs scale [not designed for PB/EBs]

Expensive: Large initial capex + $10K-$50K/TB/year

90% of data was thrown away because of cost

OLTP ERP CRM LOB

Data Warehouse

Business Intelligence

Page 6: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Cloud data lakes are the future

Customers want:

Data Lake

Page 7: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Data demands are driving next generation architectures for analytics and innovation

Structured dataData that are highly normalized with common schema and stored in relational databases, powering transactional line-of-business applications

ERP CRM

LOB applications

Semistructured dataData that contain identifiers without conforming to a predefined schema

Mobile Social

Sensors POS terminals

Unstructured dataData that do not conform to a data model and are typically stored as individual files

Phone calls

Images

Videos Email

Batch loadExtracts data from various data sources at periodic intervals and moves them to the data lake

AWS Glue

StreamingIngests data that are generated from multiple sources such as log files, telemetry, mobile applications, and social networks

Amazon Kinesis

Amazon S3 data lakeCloud-scale centralized and scalable architecture that enables enterprise data science

Amazon S3

Amazon Redshift

And data stored in the data lake can also be made directly searchable and queryable

Amazon Athena

AnalyticsData Warehouses are repositories of normalized data and provide the foundational technology for BI

Amazon QuickSight

Amazon EMR

Amazon MSK

Machine LearningStoring data in an Amazon S3 data lake enables customers to leverage predictive or prescriptive analytics; perform ad-hoc analyses; and use AI/ML for automation and efficiency

Amazon SageMaker

AWS Deep Learning AMIs

Amazon EMR

Page 8: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

On premises data

Web app data

Amazon RDS

Other databases

Streaming data

AWS Glue Data Catalog

AWS Glue CrawlerAmazon S3

Amazon Redshift Spectrum

AWS Glue ETL

Amazon Athena

Amazon EMR

Amazon QuickSight

Amazon SageMaker

AWS Lake Formation

Goal #1: Security and governance

layer

Goal #2: Manage S3 permissions for

Analytics

Goal #3: Help build easy data ingestion

pipelines

Your data Analytics and ML

Data Lakes on AWS

Page 9: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Why should you build on AWS?

Page 10: Harness the power of your data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Migration & Streaming Services

Infrastructure Data Catalog & ETL

Security & Management

Data Warehousing

Big DataProcessing

Interactive Query

Operational Analytics

Real timeAnalytics

Serverless

Data processing

Data movement

Analytics

Data lake infrastructure & management

Dashboards Predictive Analytics

Data, visualization, engagement, & machine learning

Digital User EngagementData

Page 11: Harness the power of your data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Data movement

Analytics

Data lake infrastructure & management

Data, visualization, engagement, & machine learning

+ many more

RedshiftEMR (Spark & Hadoop)

AthenaElasticsearch Service

Kinesis Data Analytics

AWS Glue (Spark & Python)

S3/Glacier AWS GlueLake Formation

QuickSight SageMaker Comprehend Lex Polly Rekognition Translate

Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka

PinpointData Exchange

NEW

Page 12: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Most secureServices for security and governance

Compliance

AWS Artifact

Amazon Inspector

Amazon Cloud HSM

Amazon Cognito

AWS CloudTrail

Security

Amazon GuardDuty

AWS Shield

AWS WAF

Amazon Macie

VPC

Encryption

AWS Certification Manager

AWS Key Management Service

Encryption at rest

Encryption in transit

Bring your own keys, HSM support

Identity

AWS IAM

AWS SSO

Amazon Cloud Directory

AWS Directory Service

AWS Organizations

Customers need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake

Page 13: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Most secure — Certifications

CSACloud Security Alliance Controls

ISO 9001Global Quality Standard

ISO 27001Security Management Controls

ISO 27017Cloud Specific Controls

ISO 27018Personal Data Protection

PCI DSS Level 1Payment Card Standards

SOC 1Audit Controls Report

SOC 2Security, Availability, & Confidentiality Report

SOC 3General Controls Report

Global United States

CJISCriminal Justice Information Services

DoD SRGDoD Data Processing

FedRAMPGovernment Data Standards

FERPAEducational Privacy Act

FIPSGovernment Security Standards

FISMAFederal Information Security Management

GxPQuality Guidelines and Regulations

ISO FFIECFinancial Institutions Regulation

HIPPAProtected Health Information

ITARInternational Arms Regulations

MPAAProtected Media Content

NISTNational Institute of Standards and Technology

SEC Rule 17a-4(f)Financial DataStandards

VPAT/Section 508Accountability Standards

Asia Pacific

FISC [Japan]Financial Industry Information Systems

IRAP [Australia]Australian Security Standards

K-ISMS [Korea]Korean Information Security

MTCS Tier 3 [Singapore]Multi-Tier Cloud Security Standard

My Number Act [Japan]Personal Information Protection

Europe

C5 [Germany]Operational Security Attestation

Cyber Essentials Plus [UK]Cyber Threat

Protection

G-Cloud [UK]UK Government Standards

IT-Grundschutz[Germany]Baseline Protection

Methodology

X P

G

Page 14: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Most cost effectiveDecouple compute and storage, choice of PAYG analytics services

Storage

S3 tiers & intelligent tiering

From $0.023 per GB/mo to as low as $0.004 per GB/mo

Compute

Spot & reserved instances

Save up to 90% off on-demand prices

EMR

Autoscaling

57% less thanon-premises per IDC report

Redshift

Less than a tenth of the cost of traditional solutions.

Athena & QuickSight

Serverless pay only for what is used

Page 15: Harness the power of your data

© 2020, Amazon Web Services, Inc. or its Affiliates.

More data lakes and analytics than anywhere elseTens of thousands of data lakes run on AWS across all industries

Page 16: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Business Outcomes & Sample Reference Architectures

Page 17: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Vanguard is the largest provider of mutual funds & second largest provider of exchange traded funds with $5.1 trillion in assets.

C HA L L E NGELoB’s are increasingly interested in analytics. IT needs to make ~ 1PB of data actionable for Fraud and Investment Fund analytics.

S OL U T IONArchitect a cloud-based data lakesolution with S3 and EMR. BI tools connect to Presto on EMR to democratize data.

RE S U LTEmpower 150+ users with a ‘Ready for Analytics’ environment of curated data sets to realize operational efficiency and with a phased approach reduced EC2 & EMR costs by $600k.

Page 18: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Databases

Files

Vendor

Data

On-premise AWS Cloud

Region

Data lake

Process R1

Process R2

Load raw

Bucket R1

Bucket R2

Bucket R3

Raw data

Process C1

Process C2

Cleansing

Bucket C1

Bucket C2

Bucket C3

Cleansed data

Process T1

Process T2

Transform

Bucket T1

Bucket T2

Bucket T3

Ready for analytics

Vanguard Reference Architecture• Ingest 100+ data sources to S3 for

‘specific data domains.’

• Sqoop is used to bring data from DB’s and s tored in S3.

• Transient EMR clusters are used to clean Raw data, Transform data and perform Analytics.

• Data Scientists run models via Jupyter & Spark on EMR using curated data sets or go directly to the raw data in S3, as needed.

• A Hive metastore has all the metadata about the data and tables in the clus ter.

• SQL Analysts use Presto, Hue, Hive to perform ad hoc queries.

• LoB users (institutional, retail, international) access the curated data sets in EMR via Presto to their Tableau Server.

Page 19: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Salesforce Marketing Cloud’s DMP captures, unifies, & activates data to strengthen relationships across every touchpoint

C H ALLENG EThe DMP has 40PT data and growing 4% WoW. Look-alike models need to run at scale

SO LUTIONEMR for data science, batch workloads and on-demand segmentation using Spark and MapReduce running 3,000 clusters (mostly transient) daily. Push results to S3, Redshift and RDS

R ESULT

Optimized pipelines publish to 100’s of API’s

Every 60 seconds, the DMP processes nearly:• 4.3M in user match requests• 1.6M in page views• 8.75M data capture events• 700k ad impressions

Page 20: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Users

Client Partners

Ad hoc Queries/Insights

Redshift Spectrum

External APIs

Page

App

Data Processing

Event

(time-based)On-Demand

Segmentation

Data Science & Batch Workloads

NRT

Data Collection

Log Files

CRM

Data Ingestion

(Batch)

Data Ingestion

(Real Time Feed)

Page

App

• Data Collection. Routinely process 40PB of data aggregated from webpages, browsers, and mobi le apps. This data is ingested into S3 via a real-time pipeline us ing Kafka.

• Data Processing. Data is ingested us ing Spark Streaming. Segmentation Rules are kept in RDS. Us ing 3,000 transient EMR clusters with heavy dependency on Spot Instances and Instance Fleets, attributes are assigned and segmentation is completed. In rea l -time, they can send this data to their partners to leverage in their targeted ads using look-alike models.

• Data Analytics & Activation. EMR pushes to RedShift where insights are available to customers on the pages or apps for targeted ads to 100’s of partners.

Salesforce DMP Reference ArchitectureWith 3,000 transient clusters running at scale, Salesforce controlled costs using Spot Instances and Instance Fleets

Data Analytics & Activation

Page 21: Harness the power of your data

Fortnite is EPIC Games sensation attracting more than 140M players growing by 2PT/ month.

C H ALLENG EFortnite is free-to-play with its revenue coming entirely from in-game micro-transactions, meaning its revenue depends on continuously capturing the attention of gamers through new content and continuous innovation. To operate this way, it needs an up-to-the-minute insights of gamer experience.

S O LUTIONBuilt an AWS data lake using Kinesis to feed telemetry data to S3, EMR for Analytics and DynamoDB for fast querying.

R ES ULTGain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, resulting in the most popular game played in the world.

Page 22: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Near real-time pipeline

Batch pipelines

Grafana

Scoreboards API

Limited raw datareal time

ad-hoc SQL

Tableau/BI

Ad-hoc SQL

DynamoDB

Game clients

Game servers

Launcher

Game services

Kinesis

Spark on EMR

User ETL Metric definition

APIs

Other sources

S3

Databases

ETL using EMR

S3(Data lake)

• Entire analytics platform running on AWS.

• Amazon S3 leveraged as a data lake.

• Al l telemetry data is collected with Amazon Kinesis

• Large EMR cluster for the bulk of batch data processing & EMR Spark for rea l-time analytics.

• DynamoDB to create scoreboards and real-time queries.

• Game designers use data to inform their decisions including what to patch, which weapons to introduce or discontinue, and more.

Epic Games Reference ArchitectureEpic Games uses AWS Analytics platform to power the Fortnite game for 140M+ players

Page 23: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Asurion is a leader in device protection and customer support solutions for smartphones, tablets, consumer electronics, and appliances.

CH A L L E N GENeed to analyze semi-structured and unstructured data sets from voice-to-text, claims data, and social from >290M customers for customer behavior insights.

S O L UT IO NUse S3 as a data lake to store all data in a single location with EMR processing raw data then pushed to Redshift for fast BI.

R E S ULTAchieved improved technical efficiencies and costs savings of 55% when compared to on-demand and 40% savings when compared to Reserved Instances.

Page 24: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

• Asurian has PB scale platform collecting data from over 20+ sources from telephony, voice-to-text, claims data, and social media s ites.

• Real-time data i s streamed with Kinesis and Lambda.

• Raw data lands in S3 serving as the centra l repository and Data Lake.

• EMR is used to process and curate data us ing Apache Hive, Apache Spark, and Presto.

• The data lake supports:

• 1,000 Ad hoc queries/day

• 25+ Spark jobs

• 2PB of data

• Redshift provides fast analytics for BI tools.

Data from other applications

Data from OLTP

Application Events & Logging

Domain applications

EMR

Data preparation: Process raw data into meaningful content

Access Virtualization(AD and User

Profiles)

3rd party BI Tools

Orchestration – Jobs, Plans, Workflows –Enterprise Scheduler – Information Lifecycle Management

Real-time data

Redshift

DynamoDB

S3

Kinesis

Data Collection Services

(ODBC/ JDBC, CDC, Event Streaming,

APIs)

Asurian Reference ArchitectureAsurian completes customer behavior insights on 290M members using AWS Data Lake & Analytics services

Page 25: Harness the power of your data

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Thank You!