View
11
Download
0
Category
Preview:
Citation preview
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building Your Data Lake
on AWS
Luke Anderson
Business Development, AWS
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect from the session
1. Defining the Data Lake
2. Reducing Costs
3. Increasing Performance
4. Planning for the Future
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink how to become a data-driven business
• Business outcomes
• Experimentation
• Agile and timely
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics looked like this
(Duplication & Sprawl)
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQLRaw Data
ETL
Advanced Analytics
ETL
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous data sets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Data Lake ComponentsAny analytic workload, any scale, at the lowest possible cost
Insights
Analytics
Data Lake
Data Movement
QuickSight SageMaker
Glue(ETL & Data Catalog)
S3/Glacier(Storage)
Redshift+Spectrum
EMR Athena
Elasticsearch serviceKinesis Data Analytics
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams
Real-time
Comprehend
DW Big data processing Interactive
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capabilityObject-level control
at any scale
Business insight into
your dataTwice as many partner
integrations
Most ways to bring
data in
Reasons to choose Amazon S3 for data lake
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reducing Data Lake Costs
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS Use EMR/Hadoop with local
HDFS for hottest data sets
Store cooler data in S3 and
cold in Glacier to reduce costs
Use S3 Analytics to optimize
tiering strategy
S3 Analytics
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Process data in place…
Amazon Athena Amazon Redshift
Spectrum
Amazon EMRAWS Glue
Amazon S3
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR: Decouple compute & storage
Highly distributed
processing frameworks such
as Hadoop/Spark
Compress datasets
Columnar file formats
Aggregate small files
S3distcp “group-by” clause
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum: Exabyte Scale
query-in-place
Structured data w/ joins
Multiple on-demand
clusters-scale concurrency
Columnar file formats
Data partitioning
Better query performance
with predicate pushdown
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena: Query without ETL
Serverless service
Schema on read
Compress datasets
Columnar file formats
Optimize file sizes
Optimize querying (Presto
backend)
Query Data in Glacier
(Coming)
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: All of these tools…
retrieve a lot of data they don’t need and
do the heavy lifting
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: You need to….
entire object from Amazon Glacier to Amazon S3
and then use it.
Amazon
S3
Amazon
Glacier
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Select
Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Motivation Behind S3 Select
GET all the data from S3 objects, and my application will filter the data that I need
Redshift Spectrum Example:
• Beta customer: Run 50,000 queries
• Amount of data fetched from S3: 6 PBs
• Amount of data used in Redshift: 650 TB
Data needed from S3: 10%
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
SELECT a filtered set of data from within an object using standard SQL Statements
• First content aware API within Amazon S3
• Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system
• SQL Statement operates on a per-object basis—not across a group of objects
• Works and scales like GET requests
• Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow
• Who will use it?
• Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines
• Everyone doing log mining
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
Output
Format: delimited text (CSV,
TSV), JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV,
TSV), JSON …
Compression: GZIP …
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Simple pattern matches
…get-object …object… | awk -F ’{ if($4=="x") print $1}’
...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Serverless applications
Amazon
S3
AWS
Lambda
Amazon
SNS
S3
Select
Lambda
Trigger
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('\n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X Faster at 1/5 of the cost
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo – S3 Select Timing
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select with Presto
Works with your existing Hive Metastore
Automatically converts predicates into S3 Select requests
Amazon S3
S3 Select
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
Amazon S3 Select: Accelerating big data
After
After
5X Faster with 1/40 of the CPU
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Amazon Glacier Select
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Amazon Glacier Select Works
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Delivering Results Faster
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimizing data lake performance
Aggregate small files
EMR: S3distcp
Amazon Kinesis Firehose
S3 Select
Big data cheaper, faster
Up to 400% faster
Data Formats
Columnar formats
EMRFS consistent view
Amazon
S3
Amazon
DynamoDB
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis—Real Time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data storesAnalyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL
New
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data preparation accounts for ~80% of the
work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Serverless Data catalog & ETL
service
Data CatalogETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker (GA)The quickest and easiest way to get ML models from idea to production
End-to-End
Machine Learning
Platform
Zero setup Flexible Model
Training
Pay by the second
$
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Planning for the Future
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transactional Data
Stream Data
Collect Store Analyze Visualize
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
Am
azo
n E
las
tic
Ma
pR
ed
uc
e
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
ea
m P
roc
es
sin
gB
atc
hIn
tera
cti
ve
Lo
gg
ing
Str
ea
m S
tora
ge
IoT
Ap
pli
ca
tio
ns
File
Sto
rag
e An
aly
sis
& V
isu
aliza
tio
n
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
File Data
No
teb
oo
ks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Evolve As Needed!
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Training Offer
Make your data driven decisions count, and make a career in Big
Data on AWS. Follow the Big Data Specialty learning path and
become a specialist in Big Data:
• Implement core AWS Big Data services according to best
practices
• Design and maintain Big Data
• Leverage tools to automate data analysis
Certified Cloud
PractitionerAssociate-level Certification
AWS Certified Big Data - Specialty
• Enterprise solutions
architects
• Data scientists
• Big Data solutions
architects
• Data analysts
Who should attend
Free AWS digital training: Foundational
knowledge
Big Data on AWS – 3-day Classroom Training
Free AWS digital training:
Big Data Technology Fundamentals
Visit www.aws.training to find out more.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We hope you found it interesting! A kind reminder to complete the survey.
Let us know what you thought of today’s event and how we can improve
the event experience for you in the future.
Thank You For Attending
AWS Data Driven Decisions Webinar Series.
aws-apac-marketing@amazon.com
twitter.com/AWSCloud
facebook.com/AmazonWebServices
youtube.com/user/AmazonWebServices
slideshare.net/AmazonWebServices
twitch.tv/aws
Recommended