Upload
amazon-web-services
View
641
Download
1
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stuart Wong, Platform Engineer ([email protected]) - Monsanto
Vishnu Kannan, Analytics Technical Platform Lead ([email protected]) – Monsanto
Lex Crosett, AWS Enterprises Solutions Architect ([email protected])
December 2, 2016
Case StudyHow Monsanto Uses Amazon EFS with
Large Scale Geospatial Data Sets
Goals and expectations for this session
Overall goal: Introduce you to Amazon EFS (what it is, its
features, how it can help you)
The Monsanto team will describe their big data applications
using EFS as a storage platform
Session intended for all levels: We’ll cover both beginner
topics and more advanced concepts
We’ll do Q&A at the end
Agenda
1. Provide overview of EFS
2. Discuss EFS availability, scalability durability, and security
properties
3. Share key EFS performance characteristics
4. Review Monsanto case study
Overview of Amazon EFS
Batches and Streams
AWS Direct
Connect
AWS Snowball,
Snowball Edge,
Snowmobile
3rd Party
Connectors
Transfer
Acceleration
Amazon
Storage
Gateway
Amazon Kinesis
Firehose
File
Amazon EFS
Block
Amazon EBS (persistent)
Object
Amazon GlacierAmazon S3 Amazon EC2
Instance Store (ephemeral)
Batches and Streams
AWS Direct
Connect
AWS Snowball,
Snowball Edge,
Snowmobile
3rd Party
Connectors
Transfer
Acceleration
Amazon
Storage
Gateway
Amazon Kinesis
Firehose
File
Amazon EFS
Block
Amazon EBS (persistent)
Object
Amazon GlacierAmazon S3 Amazon EC2
Instance Store (ephemeral)
A fully managed file system for Amazon EC2 instances
Exposes a file system interface that works with standard
operating system APIs
Provides file system access semantics (consistency, locking)
Sharable across thousands of instances
Designed to grow elastically to petabyte scale
Built for performance across a wide variety of workloads
Highly available and durable
What is Amazon EFS?
Operating file storage on-prem today is a pain
Application owner
or developer
IT administrator
Business owner
Estimate demand
Procure hardware
Set aside physical space
Set up and maintain hardware (and network)
Manage access and security
Provide demand forecasts/business case
Add lead times and extra coordination to your schedule
Limit your flexibility and agility
Make up-front capital investments, over-buy, stay on a
constant upgrade/refresh cycle
Sacrifice business agility
Distract your people from your business’s mission
Building your own on the cloud is too much
work and is expensive
Use a shared file
layer
Replicate EBS
volumes (1 per
EC2 instance)
Substantial management overhead (sync data, provision
and manage volumes)
Costly (one volume per instance)
Complex to set up and maintain
Scale challenges
Costly (compute + storage)
We focused on changing the game
Simple Elastic Scalable
1 2 3
Highly durable
Highly available
Amazon EFS is simple
Fully managed
- No hardware, network, file layer
- Create a scalable file system in seconds!
Seamless integration with existing tools and apps
- NFS v4.1—widespread, open
- Standard file system access semantics
- Works with standard OS file system APIs
Simple pricing = simple forecasting
1
Amazon EFS is elastic
File systems grow and shrink automatically
as you add and remove files
No need to provision storage capacity or
performance
You pay only for the storage space you use,
with no minimum fee
EFS price: $0.30/GB-month
2
File systems can grow to petabyte scale
Throughput and IOPS scale automatically
as file systems grow
Consistent low latencies regardless of file
system size
Support for thousands of concurrent NFS
connections
Amazon EFS is scalable3
Designed to sustain AZ offline conditions
Superior to traditional NAS availability
models
Appropriate for production/tier 0
applications
Highly durable and highly available
Several security mechanisms
Control network traffic to and from file systems (mount
targets) by using VPC security groups and network ACLs
Control file and directory access by using POSIX
permissions
Control administrative access (API access) to file
systems by using AWS Identity and Access Management
(IAM)
In which regions can I use EFS?
US West (Oregon)
US East (N. Virginia)
EU (Ireland)
Data is stored in multiple AZs for high availability
and durability
Every file
system object
(directory, file,
and link) is
redundantly
stored across
multiple AZs in
a region
AVAILABILITY
ZONE 1
REGION
AVAILABILITY
ZONE 2
AVAILABILITY
ZONE 3
Amazon
EFS
EFS provides throughput that scales as a file system
grows
As a file system gets larger, it
needs access to more
throughput
Many file workloads are spiky,
with peak throughput well above
average levels
Amazon EFS scalable bursting model is designed to
make performance available when you need it
Bursting model examples
File system size Read/write throughput
A 1 TB EFS file system can… • Drive up to 50 MB/s continuously
or
• Burst to 100 MB/s for up to 12 hours each day*
A 10 TB EFS file system can… • Drive up to 500 MB/s continuously
or
• Burst to 1 GB/s for up to 12 hours each day*
A 100 GB EFS file system can… • Drive up to 5 MB/s continuously
or
• Burst to 100 MB/s for up to 72 minutes each day*
Amazon EFS is designed for wide spectrum of use cases
High throughput and parallel I/O
Low latency and serial I/O
Genomics
Big data analytics
Scale-out jobs
Home directories
Content management
Web servingMetadata-intensive
jobs
How Monsanto Uses Amazon
EFS with Large Scale
Geospatial Data Sets
About us
Vishnu Alavur Kannan, Analytics Technical Platforms Leadhttps://www.linkedin.com/in/vishnukannan
• 15+ years in IT, software engineer @heart
• Led engineering teams throughout my career
• ‘A’ players make all the difference
• 50:1, 100 :1, rarely on any other profession
@Monsanto for two reasons:
• I believe in our commitment for sustainable agriculture
• I am able to do top-flight Engineering R&D
Stuart Wong, Platform Engineer (@cgswong)https://www.linkedin.com/in/cgswong
• 15+ years in IT
• SysAdmin, DBA, team lead, Infrastructure Engineer
• Love all things technology and open source
• @Monsanto for two reasons:
• I believe in the mission
• Able to work and learn from people much smarter than me
MonsantoA sustainable agriculture company• Bringing a broad range of solutions to help nourish our growing
world
• Collaborating to help tackle some of the world’s biggest challenges
• >20,000 employees in 66 countries
• >50% employees based outside of the United States
• One of the 25 World’s Best Multinational Workplaces by Great Place to Work Institute
What to Expect from the Session
• Some background
• Geospatial on EFS
• Analytics@scale on EFS
• Final thoughts
• Q&A
Some Background
Embarked on “cloud first” strategy in 2015. Specifically, re-
factor, or build new applications/services in cloud. We had:
• Legacy on premise environment
• Scalability constraints
• Growth constraints
• Stability and performance challenges
• Proprietary applications
Geospatial Make Over…
To move data and analytics to cloud we needed:
• Open source, standards based
• Scalable and performant
• Fault tolerant
• Secured, but easy to use
• Cost effective
Geospatial Data Assets in R&D
Geospatial Catalog
Initially…
GeoServer Clustering
Problems
• No built-in clustering
• Manual setup process
• No shared state
• In-memory caching
Solution
• Clustering Extension• Handles change detection and broadcasting
• Clears relevant caches
• Handles HTTP session sharing
• RDS for data directory (vector data)
• EFS for configuration and raster data
Options we compared
Database BYO Amazon EFS
Setup 3 2 4
Management 3 2 5
Scalability 3 4 5
Performance 3 5 4
Cost 3 3 4
15 16 22
Score Rating
1 Poor
2 Bad
3 Fair
4 Good
5 Excellent
Current…
EFS Performance - Bytes and Credits
EFS Performance – Limits and Total IO
Analytics@scale on EFS
• Analytics as a Service
Collaborative Exploratory & Discovery Analytics Platform
• Production Eco-system
Use case: Environmental Classification
Analytics as a ServiceCollaborative Exploratory & Discovery Analytics Platform
Exploratory - Nonprime Discovery - Prime
Development Environments @SCALE
• Big-data DevOps
• Model Deployments @scale
• Big-data workloads
• Cloud Best practices
• Monitoring, Alerting…
• Security/ISO
Analytics @SCALE BLUE - GREEN
• Co-engineering
• Thinking scale ahead
• Model refactoring
• Infrastructure as code
• Distributed computing
• API & Streams
CPLEXDOC
• User sessions and backups• EMR configurations and workloads• Docker/ECS configurations and workloads• R & Python Package stores
Collaborative Exploratory & Discovery Analytics Platform
High level Architecturevia Ansible/CFN - AMI, Docker - based Instrumentation
Exploratory Analytics Platform – Non-prime/Sandboxvia Ansible/CFN - AMI, Docker - based Instrumentation
Discovery Analytics Platform – Primevia Ansible/CFN - AMI, Docker - based Instrumentation
Analytics as a Service: Exploratory and Discovery Analytics - Development Environments
Data Scientists, Developers and Novice Users
From Discovery to Production - Culture, approach & adoption
Know
Your
Users
For Community
By Community
Tailor by
Needs
Balance Freedom
with Governance
Drive
User
Adoption
Environments
iteratively served to
everyone @monsanto
Enable analytical capabilities @scale for the enterprise integrated with Product Platforms
As of today, # of unique data scientists across groups utilizing our discovery analytics environments
Model maturity Global Scalability
Core teams : Train the trainee to share knowledge and best practices utilizing the environment
Production Eco-system: Environmental Classification
Monsanto’s competitive advantageBy managing interactions between different zones, we can enable:
• Prescriptive recommendations
• Predictive Genotype x Environment interaction
• Accelerate Research Pipeline
• Advisory products
• Predict ranking of hybrid(s) and inbred(s) within and across
environments
• Link Research to Manufacturing to Customer fields
Hybrid rank
1
2
3
Hybrid rank
1
2
3
Hybrid rank
1
2
3
Hybrid rank
3
1
2
Hybrid rank
3
1
2
Hybrid rank
3
1
2
EC classes
are regions in
feature space
Topography
Climate
Monsanto
Advisors:
Right treatment
in the
right environment
Environmental Classification Engine - Objectives
Identify discrete zones within a field based on sub-field environmental and macro-level weather factors and their relationship to a phenotype performance (e.g. yield)
Treatments developed and tested on mapped research fields
The best treatments applied to each sub-field environment in production fields
Data analytics find the
best treatment for
each sub field
environment
Environmental Classification Engine @scale
Data
Provisioning APIsData Transformation QA/QC
Rules
Scala
Python
Scikit
API
API
From Gridding our fields to Gridding the Entire United States
Amazon EMR
Environmental Classification Engine
EFS & EMR Performance
DATA INGESTION AND TRANSFORMATION VIA API’s AND STREAMS
Streaming
Business Intelligence
Production Eco-system – RUN ANALYTICS@SCALE IN THE CLOUD
Collaborative Data Science – EXPLORATORY & DISCOVERY ANALYTICS PLATFORM
DATA DRIVEN PRODUCTS
KAFKA Streams Data Warehouse*Big-data
Model outputs via APIs & Streams
In-house/Third Party: Platforms AWS, GCP, Cloudera, DataStax, IBM, Azure, Domino labs…
Prescriptive PredictiveCognitive Historical
Models - Deep Learning, Computational Pipelines, Classification & Simulation Engines
Turn Data into Actionable Insights
Recommendations
• Keep it simple, follow AWS guidance
• Choose the appropriate mode
• Ensure tooling can read file system size
• Plan for redeployments
• Check AWS limits
• Plan backup/recovery early
• Know performance model
Takeaways
• Simple setup
• No Management!
• Usage based performance
• Usage based cost
• Almost unlimited scale
• Remember it is just a shared filesystem
Thank you!
Questions?
Remember to complete
your evaluations!