48
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stuart Wong, Platform Engineer ([email protected]) - Monsanto Vishnu Kannan, Analytics Technical Platform Lead ([email protected]) Monsanto Lex Crosett, AWS Enterprises Solutions Architect ([email protected]) December 2, 2016 Case Study How Monsanto Uses Amazon EFS with Large Scale Geospatial Data Sets

AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Embed Size (px)

Citation preview

Page 1: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Stuart Wong, Platform Engineer ([email protected]) - Monsanto

Vishnu Kannan, Analytics Technical Platform Lead ([email protected]) – Monsanto

Lex Crosett, AWS Enterprises Solutions Architect ([email protected])

December 2, 2016

Case StudyHow Monsanto Uses Amazon EFS with

Large Scale Geospatial Data Sets

Page 2: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Goals and expectations for this session

Overall goal: Introduce you to Amazon EFS (what it is, its

features, how it can help you)

The Monsanto team will describe their big data applications

using EFS as a storage platform

Session intended for all levels: We’ll cover both beginner

topics and more advanced concepts

We’ll do Q&A at the end

Page 3: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Agenda

1. Provide overview of EFS

2. Discuss EFS availability, scalability durability, and security

properties

3. Share key EFS performance characteristics

4. Review Monsanto case study

Page 4: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Overview of Amazon EFS

Page 5: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Batches and Streams

AWS Direct

Connect

AWS Snowball,

Snowball Edge,

Snowmobile

3rd Party

Connectors

Transfer

Acceleration

Amazon

Storage

Gateway

Amazon Kinesis

Firehose

File

Amazon EFS

Block

Amazon EBS (persistent)

Object

Amazon GlacierAmazon S3 Amazon EC2

Instance Store (ephemeral)

Page 6: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Batches and Streams

AWS Direct

Connect

AWS Snowball,

Snowball Edge,

Snowmobile

3rd Party

Connectors

Transfer

Acceleration

Amazon

Storage

Gateway

Amazon Kinesis

Firehose

File

Amazon EFS

Block

Amazon EBS (persistent)

Object

Amazon GlacierAmazon S3 Amazon EC2

Instance Store (ephemeral)

Page 7: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

A fully managed file system for Amazon EC2 instances

Exposes a file system interface that works with standard

operating system APIs

Provides file system access semantics (consistency, locking)

Sharable across thousands of instances

Designed to grow elastically to petabyte scale

Built for performance across a wide variety of workloads

Highly available and durable

What is Amazon EFS?

Page 8: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Operating file storage on-prem today is a pain

Application owner

or developer

IT administrator

Business owner

Estimate demand

Procure hardware

Set aside physical space

Set up and maintain hardware (and network)

Manage access and security

Provide demand forecasts/business case

Add lead times and extra coordination to your schedule

Limit your flexibility and agility

Make up-front capital investments, over-buy, stay on a

constant upgrade/refresh cycle

Sacrifice business agility

Distract your people from your business’s mission

Page 9: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Building your own on the cloud is too much

work and is expensive

Use a shared file

layer

Replicate EBS

volumes (1 per

EC2 instance)

Substantial management overhead (sync data, provision

and manage volumes)

Costly (one volume per instance)

Complex to set up and maintain

Scale challenges

Costly (compute + storage)

Page 10: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

We focused on changing the game

Simple Elastic Scalable

1 2 3

Highly durable

Highly available

Page 11: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Amazon EFS is simple

Fully managed

- No hardware, network, file layer

- Create a scalable file system in seconds!

Seamless integration with existing tools and apps

- NFS v4.1—widespread, open

- Standard file system access semantics

- Works with standard OS file system APIs

Simple pricing = simple forecasting

1

Page 12: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Amazon EFS is elastic

File systems grow and shrink automatically

as you add and remove files

No need to provision storage capacity or

performance

You pay only for the storage space you use,

with no minimum fee

EFS price: $0.30/GB-month

2

Page 13: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

File systems can grow to petabyte scale

Throughput and IOPS scale automatically

as file systems grow

Consistent low latencies regardless of file

system size

Support for thousands of concurrent NFS

connections

Amazon EFS is scalable3

Page 14: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Designed to sustain AZ offline conditions

Superior to traditional NAS availability

models

Appropriate for production/tier 0

applications

Highly durable and highly available

Page 15: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Several security mechanisms

Control network traffic to and from file systems (mount

targets) by using VPC security groups and network ACLs

Control file and directory access by using POSIX

permissions

Control administrative access (API access) to file

systems by using AWS Identity and Access Management

(IAM)

Page 16: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

In which regions can I use EFS?

US West (Oregon)

US East (N. Virginia)

EU (Ireland)

Page 17: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Data is stored in multiple AZs for high availability

and durability

Every file

system object

(directory, file,

and link) is

redundantly

stored across

multiple AZs in

a region

AVAILABILITY

ZONE 1

REGION

AVAILABILITY

ZONE 2

AVAILABILITY

ZONE 3

Amazon

EFS

Page 18: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

EFS provides throughput that scales as a file system

grows

As a file system gets larger, it

needs access to more

throughput

Many file workloads are spiky,

with peak throughput well above

average levels

Amazon EFS scalable bursting model is designed to

make performance available when you need it

Page 19: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Bursting model examples

File system size Read/write throughput

A 1 TB EFS file system can… • Drive up to 50 MB/s continuously

or

• Burst to 100 MB/s for up to 12 hours each day*

A 10 TB EFS file system can… • Drive up to 500 MB/s continuously

or

• Burst to 1 GB/s for up to 12 hours each day*

A 100 GB EFS file system can… • Drive up to 5 MB/s continuously

or

• Burst to 100 MB/s for up to 72 minutes each day*

Page 20: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Amazon EFS is designed for wide spectrum of use cases

High throughput and parallel I/O

Low latency and serial I/O

Genomics

Big data analytics

Scale-out jobs

Home directories

Content management

Web servingMetadata-intensive

jobs

Page 21: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

How Monsanto Uses Amazon

EFS with Large Scale

Geospatial Data Sets

Page 22: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

About us

Vishnu Alavur Kannan, Analytics Technical Platforms Leadhttps://www.linkedin.com/in/vishnukannan

• 15+ years in IT, software engineer @heart

• Led engineering teams throughout my career

• ‘A’ players make all the difference

• 50:1, 100 :1, rarely on any other profession

@Monsanto for two reasons:

• I believe in our commitment for sustainable agriculture

• I am able to do top-flight Engineering R&D

Stuart Wong, Platform Engineer (@cgswong)https://www.linkedin.com/in/cgswong

• 15+ years in IT

• SysAdmin, DBA, team lead, Infrastructure Engineer

• Love all things technology and open source

• @Monsanto for two reasons:

• I believe in the mission

• Able to work and learn from people much smarter than me

Page 23: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

MonsantoA sustainable agriculture company• Bringing a broad range of solutions to help nourish our growing

world

• Collaborating to help tackle some of the world’s biggest challenges

• >20,000 employees in 66 countries

• >50% employees based outside of the United States

• One of the 25 World’s Best Multinational Workplaces by Great Place to Work Institute

Page 24: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

What to Expect from the Session

• Some background

• Geospatial on EFS

• Analytics@scale on EFS

• Final thoughts

• Q&A

Page 25: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Some Background

Embarked on “cloud first” strategy in 2015. Specifically, re-

factor, or build new applications/services in cloud. We had:

• Legacy on premise environment

• Scalability constraints

• Growth constraints

• Stability and performance challenges

• Proprietary applications

Page 26: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Geospatial Make Over…

To move data and analytics to cloud we needed:

• Open source, standards based

• Scalable and performant

• Fault tolerant

• Secured, but easy to use

• Cost effective

Page 27: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Geospatial Data Assets in R&D

Geospatial Catalog

Page 28: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Initially…

Page 29: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

GeoServer Clustering

Problems

• No built-in clustering

• Manual setup process

• No shared state

• In-memory caching

Solution

• Clustering Extension• Handles change detection and broadcasting

• Clears relevant caches

• Handles HTTP session sharing

• RDS for data directory (vector data)

• EFS for configuration and raster data

Page 30: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Options we compared

Database BYO Amazon EFS

Setup 3 2 4

Management 3 2 5

Scalability 3 4 5

Performance 3 5 4

Cost 3 3 4

15 16 22

Score Rating

1 Poor

2 Bad

3 Fair

4 Good

5 Excellent

Page 31: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Current…

Page 32: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

EFS Performance - Bytes and Credits

Page 33: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

EFS Performance – Limits and Total IO

Page 34: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Analytics@scale on EFS

• Analytics as a Service

Collaborative Exploratory & Discovery Analytics Platform

• Production Eco-system

Use case: Environmental Classification

Page 35: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Analytics as a ServiceCollaborative Exploratory & Discovery Analytics Platform

Exploratory - Nonprime Discovery - Prime

Development Environments @SCALE

• Big-data DevOps

• Model Deployments @scale

• Big-data workloads

• Cloud Best practices

• Monitoring, Alerting…

• Security/ISO

Analytics @SCALE BLUE - GREEN

• Co-engineering

• Thinking scale ahead

• Model refactoring

• Infrastructure as code

• Distributed computing

• API & Streams

CPLEXDOC

• User sessions and backups• EMR configurations and workloads• Docker/ECS configurations and workloads• R & Python Package stores

Page 36: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Collaborative Exploratory & Discovery Analytics Platform

High level Architecturevia Ansible/CFN - AMI, Docker - based Instrumentation

Page 37: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Exploratory Analytics Platform – Non-prime/Sandboxvia Ansible/CFN - AMI, Docker - based Instrumentation

Page 38: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Discovery Analytics Platform – Primevia Ansible/CFN - AMI, Docker - based Instrumentation

Page 39: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Analytics as a Service: Exploratory and Discovery Analytics - Development Environments

Data Scientists, Developers and Novice Users

From Discovery to Production - Culture, approach & adoption

Know

Your

Users

For Community

By Community

Tailor by

Needs

Balance Freedom

with Governance

Drive

User

Adoption

Environments

iteratively served to

everyone @monsanto

Enable analytical capabilities @scale for the enterprise integrated with Product Platforms

As of today, # of unique data scientists across groups utilizing our discovery analytics environments

Model maturity Global Scalability

Core teams : Train the trainee to share knowledge and best practices utilizing the environment

Page 40: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Production Eco-system: Environmental Classification

Monsanto’s competitive advantageBy managing interactions between different zones, we can enable:

• Prescriptive recommendations

• Predictive Genotype x Environment interaction

• Accelerate Research Pipeline

• Advisory products

• Predict ranking of hybrid(s) and inbred(s) within and across

environments

• Link Research to Manufacturing to Customer fields

Hybrid rank

1

2

3

Hybrid rank

1

2

3

Hybrid rank

1

2

3

Hybrid rank

3

1

2

Hybrid rank

3

1

2

Hybrid rank

3

1

2

EC classes

are regions in

feature space

Topography

Climate

Monsanto

Advisors:

Right treatment

in the

right environment

Page 41: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Environmental Classification Engine - Objectives

Identify discrete zones within a field based on sub-field environmental and macro-level weather factors and their relationship to a phenotype performance (e.g. yield)

Treatments developed and tested on mapped research fields

The best treatments applied to each sub-field environment in production fields

Data analytics find the

best treatment for

each sub field

environment

Page 42: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Environmental Classification Engine @scale

Data

Provisioning APIsData Transformation QA/QC

Rules

Scala

Python

Scikit

API

API

From Gridding our fields to Gridding the Entire United States

Amazon EMR

Page 43: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Environmental Classification Engine

EFS & EMR Performance

Page 44: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

DATA INGESTION AND TRANSFORMATION VIA API’s AND STREAMS

Streaming

Business Intelligence

Production Eco-system – RUN ANALYTICS@SCALE IN THE CLOUD

Collaborative Data Science – EXPLORATORY & DISCOVERY ANALYTICS PLATFORM

DATA DRIVEN PRODUCTS

KAFKA Streams Data Warehouse*Big-data

Model outputs via APIs & Streams

In-house/Third Party: Platforms AWS, GCP, Cloudera, DataStax, IBM, Azure, Domino labs…

Prescriptive PredictiveCognitive Historical

Models - Deep Learning, Computational Pipelines, Classification & Simulation Engines

Turn Data into Actionable Insights

Page 45: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Recommendations

• Keep it simple, follow AWS guidance

• Choose the appropriate mode

• Ensure tooling can read file system size

• Plan for redeployments

• Check AWS limits

• Plan backup/recovery early

• Know performance model

Page 46: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Takeaways

• Simple setup

• No Management!

• Usage based performance

• Usage based cost

• Almost unlimited scale

• Remember it is just a shared filesystem

Page 47: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Thank you!

Questions?

Page 48: AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large-Scale Geospatial Data Sets (STG208)

Remember to complete

your evaluations!