24
© 2020, Amazon Web Services, Inc. or its Affiliates. Immersion Day – Building a Data Lake on AWS

Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Immersion Day – Building a Data Lake on AWS

Page 2: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Workshop Agenda

• Overview of the Data Lake• Hydrating the Data Lake• Lab - Hydrating the Data Lake• Working Within the Data Lake• Lab - Transforming and Query Data in the Data • Consuming the Data Lake – Reporting, Analytics, Machine

Learning• Lab – Extending Consuming Data Lake for Machine Learning• Automating Data Lake with AWS Lake Formation• Lab- Data Lake Automation

Page 3: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

• A centralized repository for both structured and unstructured data

• Store data as-is in open-source file formats to enable direct analytics

What is a Data Lake?

Page 4: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Why a Data Lake?

• Decouple storage from compute, allowing you to scale

• Enable advanced analytics across all of your data sources

• Reduce complexity in ETL and operational overhead

• Future extensibility as new database and analytics technologies are invented

Page 5: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Traditionally, Analytics Looked Like This

OLTP ERP CRM LOB

Data Warehouse

Business IntelligenceTBs-PBs Scale

Schema Defined Prior to Data Load

Operational and Ad Hoc Reporting

Large Initial Capex + $$K / TB/ Year

Relational Data

Page 6: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Data Lakes Extend the Traditional Approach

OLTP ERP CRM LOB

Catalog

DW Queries

Big Data Processing

Interactive Real-Time

Web Sensors SocialDevices

Business Intelligence Machine Learning TB-EBs Scale

All Data in one place, a Single Source of Truth

Relational and Non-Relational Data

Decouples (low cost) Storage and Compute

Schema on Read

Diverse Analytical Engines

Data Lake

100110000100101011100101010111001010100001011111011010001111001

01100101100100011000010

Page 7: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – All Data in One Place

Store and analyze all of your data, from all of your sources, in one

centralized location.

“Why is the data distributed in many locations? Where is the

single source of truth ?”

Page 8: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – Quick Ingest

Quickly ingest data without needing to force it into a

pre-defined schema.

“How can I collect data quickly from various sources and store

it efficiently?”

Page 9: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – Storage vs Compute

Separating your storage and compute allows you to scale each component as

required

“How can I scale up with the volume of data being generated?”

Page 10: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc analysis by applying schemas

on read, not write.

Page 11: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Building a Data Lake on AWS

Page 12: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Why AWS?Implementing a Data Lake architecture requires a broad set of tools and technologies to serve an increasingly diverse set of applications and use cases.

Set up storage1

Move data2 Cleanse, prep, and catalog data

3 Configure and enforce security and compliance policies

4 Make data available for analytics

5

Typical steps for building a data lake

Page 13: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

S3

Lake Formation & AWS Glue

Snowball KinesisData Streams

Snowmobile KinesisData Firehose

Amazon Redshift

Amazon EMR

Athena

Kinesis

Amazon ES

Robust data lake infrastructure with AWS

AmazonSageMaker

Comprehend

Amazon Rekognition

Durable and available; exabyte scale

Secure, compliant, auditable

Object-level controls for fine-grained access

Fast performance by retrieving subsets of data

Decoupling of compute and storage

On-demand resources, tiering, cost choices

Page 14: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Designed for 11 9sof durability

Designed for 99.99% availability

Durable Available High performance§ Multiple upload§ Range GET

§ Store as much as you need§ Scale storage and compute

independently§ No minimum usage commitments

Scalable§ Amazon EMR§ Amazon Redshift§ Amazon DynamoDB§ Amazon SageMaker

§ Many more

Integrated§ Simple REST API

§ AWS SDKs§ Read-after-create consistency§ Event notification§ Lifecycle policies

Easy to use

Why Amazon S3 for a Data Lake?

Page 15: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Processing & Analytics

Real-time Analytics

AI & Predictive

BI & Data Visualization

Transactional & RDBMS

AWS LambdaApache Storm on

EMR

Apache Flink on EMR

Spark Streaming on EMR

Elasticsearch Service

Kinesis Data Analytics, Kinesis Data Streams

DynamoDB

NoSQL DB Relational DatabaseAurora

EMRHadoop, Spark,

Presto

RedshiftData Warehouse

AthenaQuery Service

Amazon LexSpeech recognition

Amazon Rekognition

Amazon PollyText to speech

Machine LearningPredictive analytics

Kinesis Streams & Firehose

SageMaker

Page 16: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

What can you do with a Data Lake?

Page 17: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Query Directly with Amazon Athena

Page 18: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Analyze with Hadoop on Amazon EMR

Page 19: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Create Visualizations with Amazon QuickSight

Page 20: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Train ML Models with Amazon SageMaker

Page 21: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Create a Central Data Catalog with AWS Glue

Page 22: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Load into Downstream Services

AURORAAmazon Redshift

Amazon DynamoDB

Amazon Aurora

Amazon Elasticsearch

Run complex analytic queries against petabytes of structured data

A NoSQL database service that delivers consistent, single-digit millisecond latency at any scale.

A MySQL and PostgreSQL compatible relational database built for the cloud

Delivers Elasticsearch’s real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.

Amazon SageMakerfully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly

Page 23: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Today’s Focus

Catalog & Search Access & User Interfaces

Data Ingestion

Analytics & Serving

S3

Amazon DynamoDB

Amazon Elasticsearch Service

AWS AppSync

AmazonAPI Gateway

AmazonCognito

AWS KMS

AWSCloudTrail

Manage & Secure

AWS IAM

Amazon CloudWatch

AWS Snowball

AWS Storage Gateway

Amazon Kinesis Data

Firehose

AWS Direct Connect

AWS Database Migration

Service

AmazonAthena

Amazon EMR

AWS Glue

Amazon Redshift

Amazon DynamoDB

AmazonQuickSight

AmazonKinesis

Amazon Elasticsearch

Service

Amazon Neptune

AmazonRDS

Central StorageScalable, secure, cost-

effective

AWS Glue

AWSDataSync

AWS Transfer for SFTP

Amazon S3 Transfer Acceleration

Page 24: Immersion Day –Building a Data Lake on AWS · Business IntelligenceMachine Learning TB-EBs Scale All Data in one place, a Single Source of Truth Relational and Non-Relational Data

© 2020, Amazon Web Services, Inc. or its Affiliates.

Thanks! You