Monsanto Uses Amazon EFS for Large Scale Geospatial Data

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stuart Wong, Platform Engineer (carrington.wong@monsanto.com) - Monsanto
Vishnu Kannan, Analytics Technical Platform Lead (vvalav@monsanto.com) – Monsanto
Lex Crosett, AWS Enterprises Solutions Architect (crosettl@amazon.com)
December 2, 2016
Case Study
How Monsanto Uses Amazon EFS with
Large Scale Geospatial Data Sets

Goals and expectations for this session
 Overall goal: Introduce you to Amazon EFS (what it is, its
features, how it can help you)
 The Monsanto team will describe their big data applications
using EFS as a storage platform
 Session intended for all levels: We’ll cover both beginner
topics and more advanced concepts
 We’ll do Q&A at the end

Agenda
1. Provide overview of EFS
2. Discuss EFS availability, scalability durability, and security
properties
3. Share key EFS performance characteristics
4. Review Monsanto case study

Batches and Streams
AWS Direct
Connect
AWS Snowball,
Snowball Edge,
Snowmobile
3rd Party
Connectors
Transfer
Acceleration
Amazon
Storage
Gateway
Amazon Kinesis
Firehose
File
Amazon EFS
Block
Amazon EBS
(persistent)
Object
Amazon GlacierAmazon S3 Amazon EC2
Instance Store
(ephemeral)

 A fully managed file system for Amazon EC2 instances
 Exposes a file system interface that works with standard
operating system APIs
 Provides file system access semantics (consistency, locking)
 Sharable across thousands of instances
 Designed to grow elastically to petabyte scale
 Built for performance across a wide variety of workloads
 Highly available and durable
What is Amazon EFS?

Operating file storage on-prem today is a pain
Application owner
or developer
IT administrator
Business owner
 Estimate demand
 Procure hardware
 Set aside physical space
 Set up and maintain hardware (and network)
 Manage access and security
 Provide demand forecasts/business case
 Add lead times and extra coordination to your schedule
 Limit your flexibility and agility
 Make up-front capital investments, over-buy, stay on a
constant upgrade/refresh cycle
 Sacrifice business agility
 Distract your people from your business’s mission

Building your own on the cloud is too much
work and is expensive
Use a shared file
layer
Replicate EBS
volumes (1 per
EC2 instance)
 Substantial management overhead (sync data, provision
and manage volumes)
 Costly (one volume per instance)
 Complex to set up and maintain
 Scale challenges
 Costly (compute + storage)

We focused on changing the game
Simple Elastic Scalable
1 2 3
Highly durable
Highly available

Amazon EFS is simple
Fully managed
- No hardware, network, file layer
- Create a scalable file system in seconds!
Seamless integration with existing tools and apps
- NFS v4.1—widespread, open
- Standard file system access semantics
- Works with standard OS file system APIs
Simple pricing = simple forecasting
1

Amazon EFS is elastic
File systems grow and shrink automatically
as you add and remove files
No need to provision storage capacity or
performance
You pay only for the storage space you use,
with no minimum fee
EFS price: $0.30/GB-month
2

File systems can grow to petabyte scale
Throughput and IOPS scale automatically
as file systems grow
Consistent low latencies regardless of file
system size
Support for thousands of concurrent NFS
connections
Amazon EFS is scalable3

Designed to sustain AZ offline conditions
Superior to traditional NAS availability
models
Appropriate for production/tier 0
applications
Highly durable and highly available

Several security mechanisms
 Control network traffic to and from file systems (mount
targets) by using VPC security groups and network ACLs
 Control file and directory access by using POSIX
permissions
 Control administrative access (API access) to file
systems by using AWS Identity and Access Management
(IAM)

In which regions can I use EFS?
US West (Oregon)
US East (N. Virginia)
EU (Ireland)

Data is stored in multiple AZs for high availability
and durability
Every file
system object
(directory, file,
and link) is
redundantly
stored across
multiple AZs in
a region
AVAILABILITY
ZONE 1
REGION
AVAILABILITY
ZONE 2
AVAILABILITY
ZONE 3
Amazon
EFS

EFS provides throughput that scales as a file system
grows
As a file system gets larger, it
needs access to more
throughput
Many file workloads are spiky,
with peak throughput well above
average levels
Amazon EFS scalable bursting model is designed to
make performance available when you need it

Bursting model examples
File system size Read/write throughput
A 1 TB EFS file system can… • Drive up to 50 MB/s continuously
or
• Burst to 100 MB/s for up to 12 hours each day*
A 10 TB EFS file system can… • Drive up to 500 MB/s continuously
or
• Burst to 1 GB/s for up to 12 hours each day*
A 100 GB EFS file system can… • Drive up to 5 MB/s continuously
or
• Burst to 100 MB/s for up to 72 minutes each day*

Amazon EFS is designed for wide spectrum of use cases
High throughput and parallel I/O
Low latency and serial I/O
Genomics
Big data analytics
Scale-out jobs
Home directories
Content management
Web serving
Metadata-intensive
jobs

How Monsanto Uses Amazon
EFS with Large Scale
Geospatial Data Sets

About us
Vishnu Alavur Kannan, Analytics Technical Platforms Lead
https://www.linkedin.com/in/vishnukannan
• 15+ years in IT, software engineer @heart
• Led engineering teams throughout my career
• ‘A’ players make all the difference
• 50:1, 100 :1, rarely on any other profession
@Monsanto for two reasons:
• I believe in our commitment for sustainable agriculture
• I am able to do top-flight Engineering R&D
Stuart Wong, Platform Engineer (@cgswong)
https://www.linkedin.com/in/cgswong
• 15+ years in IT
• SysAdmin, DBA, team lead, Infrastructure Engineer
• Love all things technology and open source
• @Monsanto for two reasons:
• I believe in the mission
• Able to work and learn from people much smarter than me

Monsanto
A sustainable agriculture company
• Bringing a broad range of solutions to help nourish our growing
world
• Collaborating to help tackle some of the world’s biggest
challenges
• >20,000 employees in 66 countries
• >50% employees based outside of
the United States
• One of the 25 World’s Best Multinational Workplaces by Great
Place to Work Institute

What to Expect from the Session
• Some background
• Geospatial on EFS
• Analytics@scale on EFS
• Final thoughts
• Q&A

Some Background
Embarked on “cloud first” strategy in 2015. Specifically, re-
factor, or build new applications/services in cloud. We had:
• Legacy on premise environment
• Scalability constraints
• Growth constraints
• Stability and performance challenges
• Proprietary applications

Geospatial Make Over…
To move data and analytics to cloud we needed:
• Open source, standards based
• Scalable and performant
• Fault tolerant
• Secured, but easy to use
• Cost effective

Geospatial Data Assets in R&D
Geospatial
Catalog

GeoServer Clustering
Problems
• No built-in clustering
• Manual setup process
• No shared state
• In-memory caching
Solution
• Clustering Extension
• Handles change detection and broadcasting
• Clears relevant caches
• Handles HTTP session sharing
• RDS for data directory (vector data)
• EFS for configuration and raster data

Options we compared
Database BYO Amazon EFS
Setup 3 2 4
Management 3 2 5
Scalability 3 4 5
Performance 3 5 4
Cost 3 3 4
15 16 22
Score Rating
1 Poor
2 Bad
3 Fair
4 Good
5 Excellent

EFS Performance - Bytes and Credits

EFS Performance – Limits and Total IO

Analytics@scale on EFS
• Analytics as a Service
 Collaborative Exploratory & Discovery Analytics Platform
• Production Eco-system
 Use case: Environmental Classification

Analytics as a Service
Collaborative Exploratory & Discovery Analytics Platform
Exploratory - Nonprime Discovery - Prime
Development Environments @SCALE
• Big-data DevOps
• Model Deployments @scale
• Big-data workloads
• Cloud Best practices
• Monitoring, Alerting…
• Security/ISO
Analytics @SCALE BLUE - GREEN
• Co-engineering
• Thinking scale ahead
• Model refactoring
• Infrastructure as code
• Distributed computing
• API & Streams
CPLEXDOC
• User sessions and backups
• EMR configurations and workloads
• Docker/ECS configurations and workloads
• R & Python Package stores

Collaborative Exploratory & Discovery Analytics Platform
High level Architecture
via Ansible/CFN - AMI, Docker - based Instrumentation

Exploratory Analytics Platform – Non-prime/Sandbox

Discovery Analytics Platform – Prime

Analytics as a Service: Exploratory and Discovery Analytics - Development Environments
Data Scientists, Developers and Novice Users
From Discovery to Production - Culture, approach & adoption
Know
Your
Users
For Community
By Community
Tailor by
Needs
Balance Freedom
with Governance
Drive
User
Adoption
Environments
iteratively served to
everyone @monsanto
Enable analytical capabilities @scale for the enterprise integrated with
Product Platforms
As of today, # of unique data scientists across groups utilizing our discovery analytics environments
Model maturity Global Scalability
Core teams : Train the trainee to share knowledge and best practices utilizing the environment

Production Eco-system: Environmental Classification
Monsanto’s competitive advantage
By managing interactions between different zones, we can enable:
• Prescriptive recommendations
• Predictive Genotype x Environment interaction
• Accelerate Research Pipeline
• Advisory products
• Predict ranking of hybrid(s) and inbred(s) within and across
environments
• Link Research to Manufacturing to Customer fields
Hybrid rank
1
2
3
Hybrid rank
1
2
3
Hybrid rank
1
2
3
Hybrid rank
3
1
2
Hybrid rank
3
1
2
Hybrid rank
3
1
2
EC classes
are regions in
feature space
Topography
Climate
Monsanto
Advisors:
Right treatment
in the
right environment

Environmental Classification Engine - Objectives
 Identify discrete zones within a field based on sub-
field environmental and macro-level weather
factors and their relationship to a phenotype
performance (e.g. yield)
Treatments developed and tested
on mapped research fields
The best treatments applied to each sub-
field environment in production fields
Data analytics find the
best treatment for
each sub field
environment

Environmental Classification Engine @scale
Data
Provisioning APIs
Data Transformation QA/QC
Rules
Scala
Python
Scikit
API
API
From Gridding our fields to Gridding the Entire United States
Amazon EMR

Environmental Classification Engine
EFS & EMR Performance

DATA INGESTION AND TRANSFORMATION VIA API’s AND STREAMS
Streaming
Business Intelligence
Production Eco-system – RUN ANALYTICS@SCALE IN THE CLOUD
Collaborative Data Science – EXPLORATORY & DISCOVERY ANALYTICS PLATFORM
DATA DRIVEN PRODUCTS
KAFKA Streams Data Warehouse*Big-data
Model outputs via APIs & Streams
In-house/Third Party: Platforms
AWS, GCP, Cloudera, DataStax, IBM, Azure, Domino labs…
Prescriptive PredictiveCognitive Historical
Models - Deep Learning, Computational Pipelines, Classification & Simulation Engines
Turn Data into Actionable Insights

Recommendations
• Keep it simple, follow AWS guidance
• Choose the appropriate mode
• Ensure tooling can read file system size
• Plan for redeployments
• Check AWS limits
• Plan backup/recovery early
• Know performance model

Takeaways
• Simple setup
• No Management!
• Usage based performance
• Usage based cost
• Almost unlimited scale
• Remember it is just a shared filesystem 

Remember to complete
your evaluations!

Monsanto Uses Amazon EFS for Large Scale Geospatial Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Monsanto Uses Amazon EFS for Large Scale Geospatial Data

Similar to Monsanto Uses Amazon EFS for Large Scale Geospatial Data (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Monsanto Uses Amazon EFS for Large Scale Geospatial Data