Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
© 2020, Amazon Web Services, Inc. or its Affiliates.
Immersion Day – Building a Data Lake on AWS
© 2020, Amazon Web Services, Inc. or its Affiliates.
Workshop Agenda
• Overview of the Data Lake• Hydrating the Data Lake• Lab - Hydrating the Data Lake• Working Within the Data Lake• Lab - Transforming and Query Data in the Data • Consuming the Data Lake – Reporting, Analytics, Machine
Learning• Lab – Extending Consuming Data Lake for Machine Learning• Automating Data Lake with AWS Lake Formation• Lab- Data Lake Automation
© 2020, Amazon Web Services, Inc. or its Affiliates.
• A centralized repository for both structured and unstructured data
• Store data as-is in open-source file formats to enable direct analytics
What is a Data Lake?
© 2020, Amazon Web Services, Inc. or its Affiliates.
Why a Data Lake?
• Decouple storage from compute, allowing you to scale
• Enable advanced analytics across all of your data sources
• Reduce complexity in ETL and operational overhead
• Future extensibility as new database and analytics technologies are invented
© 2020, Amazon Web Services, Inc. or its Affiliates.
Traditionally, Analytics Looked Like This
OLTP ERP CRM LOB
Data Warehouse
Business IntelligenceTBs-PBs Scale
Schema Defined Prior to Data Load
Operational and Ad Hoc Reporting
Large Initial Capex + $$K / TB/ Year
Relational Data
© 2020, Amazon Web Services, Inc. or its Affiliates.
Data Lakes Extend the Traditional Approach
OLTP ERP CRM LOB
Catalog
DW Queries
Big Data Processing
Interactive Real-Time
Web Sensors SocialDevices
Business Intelligence Machine Learning TB-EBs Scale
All Data in one place, a Single Source of Truth
Relational and Non-Relational Data
Decouples (low cost) Storage and Compute
Schema on Read
Diverse Analytical Engines
Data Lake
100110000100101011100101010111001010100001011111011010001111001
01100101100100011000010
© 2020, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – All Data in One Place
Store and analyze all of your data, from all of your sources, in one
centralized location.
“Why is the data distributed in many locations? Where is the
single source of truth ?”
© 2020, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data without needing to force it into a
pre-defined schema.
“How can I collect data quickly from various sources and store
it efficiently?”
© 2020, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute allows you to scale each component as
required
“How can I scale up with the volume of data being generated?”
© 2020, Amazon Web Services, Inc. or its Affiliates.
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc analysis by applying schemas
on read, not write.
© 2020, Amazon Web Services, Inc. or its Affiliates.
Building a Data Lake on AWS
© 2020, Amazon Web Services, Inc. or its Affiliates.
Why AWS?Implementing a Data Lake architecture requires a broad set of tools and technologies to serve an increasingly diverse set of applications and use cases.
Set up storage1
Move data2 Cleanse, prep, and catalog data
3 Configure and enforce security and compliance policies
4 Make data available for analytics
5
Typical steps for building a data lake
© 2020, Amazon Web Services, Inc. or its Affiliates.
S3
Lake Formation & AWS Glue
Snowball KinesisData Streams
Snowmobile KinesisData Firehose
Amazon Redshift
Amazon EMR
Athena
Kinesis
Amazon ES
Robust data lake infrastructure with AWS
AmazonSageMaker
Comprehend
Amazon Rekognition
Durable and available; exabyte scale
Secure, compliant, auditable
Object-level controls for fine-grained access
Fast performance by retrieving subsets of data
Decoupling of compute and storage
On-demand resources, tiering, cost choices
© 2020, Amazon Web Services, Inc. or its Affiliates.
Designed for 11 9sof durability
Designed for 99.99% availability
Durable Available High performance§ Multiple upload§ Range GET
§ Store as much as you need§ Scale storage and compute
independently§ No minimum usage commitments
Scalable§ Amazon EMR§ Amazon Redshift§ Amazon DynamoDB§ Amazon SageMaker
§ Many more
Integrated§ Simple REST API
§ AWS SDKs§ Read-after-create consistency§ Event notification§ Lifecycle policies
Easy to use
Why Amazon S3 for a Data Lake?
© 2020, Amazon Web Services, Inc. or its Affiliates.
Processing & Analytics
Real-time Analytics
AI & Predictive
BI & Data Visualization
Transactional & RDBMS
AWS LambdaApache Storm on
EMR
Apache Flink on EMR
Spark Streaming on EMR
Elasticsearch Service
Kinesis Data Analytics, Kinesis Data Streams
DynamoDB
NoSQL DB Relational DatabaseAurora
EMRHadoop, Spark,
Presto
RedshiftData Warehouse
AthenaQuery Service
Amazon LexSpeech recognition
Amazon Rekognition
Amazon PollyText to speech
Machine LearningPredictive analytics
Kinesis Streams & Firehose
SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates.
What can you do with a Data Lake?
© 2020, Amazon Web Services, Inc. or its Affiliates.
Query Directly with Amazon Athena
© 2020, Amazon Web Services, Inc. or its Affiliates.
Analyze with Hadoop on Amazon EMR
© 2020, Amazon Web Services, Inc. or its Affiliates.
Create Visualizations with Amazon QuickSight
© 2020, Amazon Web Services, Inc. or its Affiliates.
Train ML Models with Amazon SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates.
Create a Central Data Catalog with AWS Glue
© 2020, Amazon Web Services, Inc. or its Affiliates.
Load into Downstream Services
AURORAAmazon Redshift
Amazon DynamoDB
Amazon Aurora
Amazon Elasticsearch
Run complex analytic queries against petabytes of structured data
A NoSQL database service that delivers consistent, single-digit millisecond latency at any scale.
A MySQL and PostgreSQL compatible relational database built for the cloud
Delivers Elasticsearch’s real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.
Amazon SageMakerfully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly
© 2020, Amazon Web Services, Inc. or its Affiliates.
Today’s Focus
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
S3
Amazon DynamoDB
Amazon Elasticsearch Service
AWS AppSync
AmazonAPI Gateway
AmazonCognito
AWS KMS
AWSCloudTrail
Manage & Secure
AWS IAM
Amazon CloudWatch
AWS Snowball
AWS Storage Gateway
Amazon Kinesis Data
Firehose
AWS Direct Connect
AWS Database Migration
Service
AmazonAthena
Amazon EMR
AWS Glue
Amazon Redshift
Amazon DynamoDB
AmazonQuickSight
AmazonKinesis
Amazon Elasticsearch
Service
Amazon Neptune
AmazonRDS
Central StorageScalable, secure, cost-
effective
AWS Glue
AWSDataSync
AWS Transfer for SFTP
Amazon S3 Transfer Acceleration
© 2020, Amazon Web Services, Inc. or its Affiliates.
Thanks! You