Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty...

Building Your Data Lake

on AWS

Luke Anderson

Business Development, AWS

What to expect from the session

1. Defining the Data Lake

2. Reducing Costs

3. Increasing Performance

4. Planning for the Future

Rethink how to become a data-driven business

• Business outcomes

• Experimentation

• Agile and timely

Traditionally, Analytics looked like this

(Duplication & Sprawl)

Hadoop

Storage

Arrays

Databases

Warehouse

Structured Data

SQLRaw Data

Advanced Analytics

Defining the AWS data lake

Data lake is an architecture with a virtually

limitless centralized storage platform capable

of categorization, processing, analysis, and

consumption of heterogeneous data sets

Key data lake attributes

• Decoupled storage and compute

• Rapid ingest and transformation

• Secure multi-tenancy

• Query in place

• Schema on read

AWS Data Lake ComponentsAny analytic workload, any scale, at the lowest possible cost

Insights

Analytics

Data Lake

Data Movement

QuickSight SageMaker

Glue(ETL & Data Catalog)

S3/Glacier(Storage)

Redshift+Spectrum

EMR Athena

Elasticsearch serviceKinesis Data Analytics

Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams

Real-time

Comprehend

DW Big data processing Interactive

Unmatched durability,

availability, and scalability

Best security, compliance, and audit

capabilityObject-level control

at any scale

Business insight into

your dataTwice as many partner

integrations

Most ways to bring

data in

Reasons to choose Amazon S3 for data lake

Reducing Data Lake Costs

Optimize costs with data tiering

Amazon

S3 standard

Amazon S3—

infrequent access

Amazon

Glacier

HDFS Use EMR/Hadoop with local

HDFS for hottest data sets

Store cooler data in S3 and

cold in Glacier to reduce costs

Use S3 Analytics to optimize

tiering strategy

S3 Analytics

Process data in place…

Amazon Athena Amazon Redshift

Spectrum

Amazon EMRAWS Glue

Amazon S3

Amazon EMR: Decouple compute & storage

Highly distributed

processing frameworks such

as Hadoop/Spark

Compress datasets

Columnar file formats

Aggregate small files

S3distcp “group-by” clause

Amazon Redshift Spectrum: Exabyte Scale

query-in-place

Structured data w/ joins

Multiple on-demand

clusters-scale concurrency

Data partitioning

Better query performance

with predicate pushdown

Amazon Athena: Query without ETL

Serverless service

Schema on read

Compress datasets

Optimize file sizes

Optimize querying (Presto

backend)

Query Data in Glacier

(Coming)

Today: All of these tools…

retrieve a lot of data they don’t need and

do the heavy lifting

Today: You need to….

entire object from Amazon Glacier to Amazon S3

and then use it.

Amazon

Glacier

Select

Amazon S3 Select and Amazon Glacier Select

Select subset of data from an object based on a SQL expression

Motivation Behind S3 Select

GET all the data from S3 objects, and my application will filter the data that I need

Redshift Spectrum Example:

• Beta customer: Run 50,000 queries

• Amount of data fetched from S3: 6 PBs

• Amount of data used in Redshift: 650 TB

Data needed from S3: 10%

Amazon S3 Select

SELECT a filtered set of data from within an object using standard SQL Statements

• First content aware API within Amazon S3

• Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system

• SQL Statement operates on a per-object basis—not across a group of objects

• Works and scales like GET requests

• Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow

• Who will use it?

• Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines

• Everyone doing log mining

Amazon S3 Select

Output

Format: delimited text (CSV,

TSV), JSON …

Clauses Data types Operators Functions

Select String Conditional String

From Integer, Float, Decimal Math Cast

Where Timestamp Logical Math

Boolean String (Like, ||) Aggregate

Format: delimited text (CSV,

TSV), JSON …

Compression: GZIP …

Amazon S3 Select: Simple pattern matches

…get-object …object… | awk -F ’{ if($4=="x") print $1}’

...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’

Amazon S3 Select: Serverless applications

Amazon

Lambda

Amazon

Select

Lambda

Trigger

Before

200 seconds and 11.2 cents

# Download and process all keys

for key in src_keys:

response = s3_client.get_object(Bucket=src_bucket, Key=key)

contents = response['Body'].read()

for line in contents.split('\n')[:-1]:

line_count +=1

data = line.split(',')

srcIp = data[0][:8]

Amazon S3 Select: Serverless MapReduce

95 seconds and costs 2.8 cents

# Select IP Address and Keys

for key in src_keys:

response = s3_client.select_object_content

(Bucket=src_bucket, Key=key, expression =

SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)

contents = response['Body'].read()

for line in contents:

line_count +=1

2X Faster at 1/5 of the cost

Demo – S3 Select Timing

Amazon S3 Select with Presto

Works with your existing Hive Metastore

Automatically converts predicates into S3 Select requests

Amazon S3

S3 Select

Before

Amazon S3 Select: Accelerating big data

5X Faster with 1/40 of the CPU

Using Amazon Glacier Select

How Amazon Glacier Select Works

Delivering Results Faster

Optimizing data lake performance

Aggregate small files

EMR: S3distcp

Amazon Kinesis Firehose

S3 Select

Big data cheaper, faster

Up to 400% faster

Data Formats

Columnar formats

EMRFS consistent view

Amazon

DynamoDB

Amazon Kinesis—Real Time

Easily collect, process, and analyze video and data streams in real time

Capture, process,

and store video

streams for analytics

Load data streams

into AWS data storesAnalyze data streams

with SQL

Build custom

applications that

analyze data streams

Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Data preparation accounts for ~80% of the

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

AWS Glue—Serverless Data catalog & ETL

service

Data CatalogETL Job

authoring

Discover data and

extract schema

Auto-generates

customizable ETL code

in Python and Spark

Automatically discovers data and stores schema

Data searchable, and available for ETL

Generates customizable code

Schedules and runs your ETL jobs

Serverless

Amazon SageMaker (GA)The quickest and easiest way to get ML models from idea to production

End-to-End

Machine Learning

Platform

Zero setup Flexible Model

Training

Pay by the second

Planning for the Future

Transactional Data

Stream Data

Collect Store Analyze Visualize

iOS Android

Web Apps

Logstash

Amazon

DynamoDB

AmazonES

Amazon

Apache

Amazon

Glacier

Amazon

Kinesis

Amazon

DynamoDB

Amazon

Redshift

Impala

Amazon ML

Streaming

Amazon

Kinesis

Lambda

Amazon

ElastiCache

Amazon

QuickSight

File Data

Predictions

Apps & APIs

Mobile

Search Data

Evolve As Needed!

AWS Training Offer

Make your data driven decisions count, and make a career in Big

Data on AWS. Follow the Big Data Specialty learning path and

become a specialist in Big Data:

• Implement core AWS Big Data services according to best

practices

• Design and maintain Big Data

• Leverage tools to automate data analysis

Certified Cloud

PractitionerAssociate-level Certification

AWS Certified Big Data - Specialty

• Enterprise solutions

architects

• Data scientists

• Big Data solutions

architects

• Data analysts

Who should attend

Free AWS digital training: Foundational

knowledge

Big Data on AWS – 3-day Classroom Training

Free AWS digital training:

Big Data Technology Fundamentals

Visit www.aws.training to find out more.

We hope you found it interesting! A kind reminder to complete the survey.

Let us know what you thought of today’s event and how we can improve

the event experience for you in the future.

Thank You For Attending

AWS Data Driven Decisions Webinar Series.

aws-apac-marketing@amazon.com

twitter.com/AWSCloud

facebook.com/AmazonWebServices

youtube.com/user/AmazonWebServices

slideshare.net/AmazonWebServices

twitch.tv/aws

Building Your Data Lake on AWS · Associate-level Certification AWS Certified Big Data - Specialty...

Documents

AWS Well-Architected Frameworkfiles.jetresolutions.webnode.com/200000016-419fb429a1/AWS_Well... · AWS Solutions Architects have years of experience architecting solutions across

Big Data for Oracle Architects

BDT201 AWS Data Pipeline - AWS re: Invent 2012

AWS re:Invent 2016: AWS Partners and Data Privacy (GPST303)

AWS Data Pipeline Developer Guides3.amazonaws.com/awsdocs/datapipeline/latest/datapipeline-dg.pdf · AWS Data Pipeline Developer Guide. What is AWS Data Pipeline? AWS Data Pipeline

AWS Black Belt Tech シリーズ 2015 - AWS Data Pipeline

Architects in Delhi Data

Protecting Your Data With AWS KMS and AWS CloudHSM

Protecting Your Data in AWS - WordPress.com · 2016-07-14 · Protecting Your Data in AWS . Encrypting Data in AWS AWS Key Management Service, CloudHSM and other options . What to

PROTASK-MARKETING DOC FBLEED · IT Project Managers Infrastructure/Cloud Architects (certiﬁed AWS/Microsoft Azure) Migration Architects Network Engineers Storage Engineers Software

Big Data Analytics Options on AWS - AWS Whitepaper

Microservices for Java Architects · What are AWS Lambdas? •You provide custom code -> AWS runs it –Java, Node.js, Python •Computing power with less management –AWS manages

Big Data on AWS

Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) | AWS re:Invent 2013

AWS Data Transfer Services: Data Ingest Strategies Into the AWS Cloud

Neufert Architects Data Ed3

AWS Black Belt Techシリーズ AWS Data Pipeline

AWS Advanced Partner - AWS Migration - NIST/FedRAMP … · 2019-10-28 · We are a member of the AWS Partner Network and have trained and certified AWS Solution Architects to enable

AWS Data Wrangler

AWS Data Pipeline - Entwicklerhandbuch · AWS Data Pipeline Entwicklerhandbuch Zugriff auf AWS Data Pipeline Datenverarbeitungskapazitäten ist anpassbar. Weitere Informationen erhalten