This document discusses how Monsanto uses Amazon EFS for large scale geospatial data sets. It provides an overview of EFS and its key features. It then details how Monsanto moved its geospatial data and analytics to the cloud using EFS, including setting up a GeoServer cluster on EFS. It also discusses how Monsanto built a collaborative analytics platform and production environmental classification engine that run analytics at scale on EFS and EMR. The document concludes with recommendations when using EFS and takeaways.
2. Goals and expectations for this session
Overall goal: Introduce you to Amazon EFS (what it is, its
features, how it can help you)
The Monsanto team will describe their big data applications
using EFS as a storage platform
Session intended for all levels: We’ll cover both beginner
topics and more advanced concepts
We’ll do Q&A at the end
3. Agenda
1. Provide overview of EFS
2. Discuss EFS availability, scalability durability, and security
properties
3. Share key EFS performance characteristics
4. Review Monsanto case study
5. Batches and Streams
AWS Direct
Connect
AWS Snowball,
Snowball Edge,
Snowmobile
3rd Party
Connectors
Transfer
Acceleration
Amazon
Storage
Gateway
Amazon Kinesis
Firehose
File
Amazon EFS
Block
Amazon EBS
(persistent)
Object
Amazon GlacierAmazon S3 Amazon EC2
Instance Store
(ephemeral)
6. Batches and Streams
AWS Direct
Connect
AWS Snowball,
Snowball Edge,
Snowmobile
3rd Party
Connectors
Transfer
Acceleration
Amazon
Storage
Gateway
Amazon Kinesis
Firehose
File
Amazon EFS
Block
Amazon EBS
(persistent)
Object
Amazon GlacierAmazon S3 Amazon EC2
Instance Store
(ephemeral)
7. A fully managed file system for Amazon EC2 instances
Exposes a file system interface that works with standard
operating system APIs
Provides file system access semantics (consistency, locking)
Sharable across thousands of instances
Designed to grow elastically to petabyte scale
Built for performance across a wide variety of workloads
Highly available and durable
What is Amazon EFS?
8. Operating file storage on-prem today is a pain
Application owner
or developer
IT administrator
Business owner
Estimate demand
Procure hardware
Set aside physical space
Set up and maintain hardware (and network)
Manage access and security
Provide demand forecasts/business case
Add lead times and extra coordination to your schedule
Limit your flexibility and agility
Make up-front capital investments, over-buy, stay on a
constant upgrade/refresh cycle
Sacrifice business agility
Distract your people from your business’s mission
9. Building your own on the cloud is too much
work and is expensive
Use a shared file
layer
Replicate EBS
volumes (1 per
EC2 instance)
Substantial management overhead (sync data, provision
and manage volumes)
Costly (one volume per instance)
Complex to set up and maintain
Scale challenges
Costly (compute + storage)
10. We focused on changing the game
Simple Elastic Scalable
1 2 3
Highly durable
Highly available
11. Amazon EFS is simple
Fully managed
- No hardware, network, file layer
- Create a scalable file system in seconds!
Seamless integration with existing tools and apps
- NFS v4.1—widespread, open
- Standard file system access semantics
- Works with standard OS file system APIs
Simple pricing = simple forecasting
1
12. Amazon EFS is elastic
File systems grow and shrink automatically
as you add and remove files
No need to provision storage capacity or
performance
You pay only for the storage space you use,
with no minimum fee
EFS price: $0.30/GB-month
2
13. File systems can grow to petabyte scale
Throughput and IOPS scale automatically
as file systems grow
Consistent low latencies regardless of file
system size
Support for thousands of concurrent NFS
connections
Amazon EFS is scalable3
14. Designed to sustain AZ offline conditions
Superior to traditional NAS availability
models
Appropriate for production/tier 0
applications
Highly durable and highly available
15. Several security mechanisms
Control network traffic to and from file systems (mount
targets) by using VPC security groups and network ACLs
Control file and directory access by using POSIX
permissions
Control administrative access (API access) to file
systems by using AWS Identity and Access Management
(IAM)
16. In which regions can I use EFS?
US West (Oregon)
US East (N. Virginia)
EU (Ireland)
17. Data is stored in multiple AZs for high availability
and durability
Every file
system object
(directory, file,
and link) is
redundantly
stored across
multiple AZs in
a region
AVAILABILITY
ZONE 1
REGION
AVAILABILITY
ZONE 2
AVAILABILITY
ZONE 3
Amazon
EFS
18. EFS provides throughput that scales as a file system
grows
As a file system gets larger, it
needs access to more
throughput
Many file workloads are spiky,
with peak throughput well above
average levels
Amazon EFS scalable bursting model is designed to
make performance available when you need it
19. Bursting model examples
File system size Read/write throughput
A 1 TB EFS file system can… • Drive up to 50 MB/s continuously
or
• Burst to 100 MB/s for up to 12 hours each day*
A 10 TB EFS file system can… • Drive up to 500 MB/s continuously
or
• Burst to 1 GB/s for up to 12 hours each day*
A 100 GB EFS file system can… • Drive up to 5 MB/s continuously
or
• Burst to 100 MB/s for up to 72 minutes each day*
20. Amazon EFS is designed for wide spectrum of use cases
High throughput and parallel I/O
Low latency and serial I/O
Genomics
Big data analytics
Scale-out jobs
Home directories
Content management
Web serving
Metadata-intensive
jobs
22. About us
Vishnu Alavur Kannan, Analytics Technical Platforms Lead
https://www.linkedin.com/in/vishnukannan
• 15+ years in IT, software engineer @heart
• Led engineering teams throughout my career
• ‘A’ players make all the difference
• 50:1, 100 :1, rarely on any other profession
@Monsanto for two reasons:
• I believe in our commitment for sustainable agriculture
• I am able to do top-flight Engineering R&D
Stuart Wong, Platform Engineer (@cgswong)
https://www.linkedin.com/in/cgswong
• 15+ years in IT
• SysAdmin, DBA, team lead, Infrastructure Engineer
• Love all things technology and open source
• @Monsanto for two reasons:
• I believe in the mission
• Able to work and learn from people much smarter than me
23. Monsanto
A sustainable agriculture company
• Bringing a broad range of solutions to help nourish our growing
world
• Collaborating to help tackle some of the world’s biggest
challenges
• >20,000 employees in 66 countries
• >50% employees based outside of
the United States
• One of the 25 World’s Best Multinational Workplaces by Great
Place to Work Institute
24. What to Expect from the Session
• Some background
• Geospatial on EFS
• Analytics@scale on EFS
• Final thoughts
• Q&A
25. Some Background
Embarked on “cloud first” strategy in 2015. Specifically, re-
factor, or build new applications/services in cloud. We had:
• Legacy on premise environment
• Scalability constraints
• Growth constraints
• Stability and performance challenges
• Proprietary applications
26. Geospatial Make Over…
To move data and analytics to cloud we needed:
• Open source, standards based
• Scalable and performant
• Fault tolerant
• Secured, but easy to use
• Cost effective
34. Analytics@scale on EFS
• Analytics as a Service
Collaborative Exploratory & Discovery Analytics Platform
• Production Eco-system
Use case: Environmental Classification
35. Analytics as a Service
Collaborative Exploratory & Discovery Analytics Platform
Exploratory - Nonprime Discovery - Prime
Development Environments @SCALE
• Big-data DevOps
• Model Deployments @scale
• Big-data workloads
• Cloud Best practices
• Monitoring, Alerting…
• Security/ISO
Analytics @SCALE BLUE - GREEN
• Co-engineering
• Thinking scale ahead
• Model refactoring
• Infrastructure as code
• Distributed computing
• API & Streams
CPLEXDOC
• User sessions and backups
• EMR configurations and workloads
• Docker/ECS configurations and workloads
• R & Python Package stores
36. Collaborative Exploratory & Discovery Analytics Platform
High level Architecture
via Ansible/CFN - AMI, Docker - based Instrumentation
39. Analytics as a Service: Exploratory and Discovery Analytics - Development Environments
Data Scientists, Developers and Novice Users
From Discovery to Production - Culture, approach & adoption
Know
Your
Users
For Community
By Community
Tailor by
Needs
Balance Freedom
with Governance
Drive
User
Adoption
Environments
iteratively served to
everyone @monsanto
Enable analytical capabilities @scale for the enterprise integrated with
Product Platforms
As of today, # of unique data scientists across groups utilizing our discovery analytics environments
Model maturity Global Scalability
Core teams : Train the trainee to share knowledge and best practices utilizing the environment
40. Production Eco-system: Environmental Classification
Monsanto’s competitive advantage
By managing interactions between different zones, we can enable:
• Prescriptive recommendations
• Predictive Genotype x Environment interaction
• Accelerate Research Pipeline
• Advisory products
• Predict ranking of hybrid(s) and inbred(s) within and across
environments
• Link Research to Manufacturing to Customer fields
Hybrid rank
1
2
3
Hybrid rank
1
2
3
Hybrid rank
1
2
3
Hybrid rank
3
1
2
Hybrid rank
3
1
2
Hybrid rank
3
1
2
EC classes
are regions in
feature space
Topography
Climate
Monsanto
Advisors:
Right treatment
in the
right environment
41. Environmental Classification Engine - Objectives
Identify discrete zones within a field based on sub-
field environmental and macro-level weather
factors and their relationship to a phenotype
performance (e.g. yield)
Treatments developed and tested
on mapped research fields
The best treatments applied to each sub-
field environment in production fields
Data analytics find the
best treatment for
each sub field
environment
42. Environmental Classification Engine @scale
Data
Provisioning APIs
Data Transformation QA/QC
Rules
Scala
Python
Scikit
API
API
From Gridding our fields to Gridding the Entire United States
Amazon EMR
44. DATA INGESTION AND TRANSFORMATION VIA API’s AND STREAMS
Streaming
Business Intelligence
Production Eco-system – RUN ANALYTICS@SCALE IN THE CLOUD
Collaborative Data Science – EXPLORATORY & DISCOVERY ANALYTICS PLATFORM
DATA DRIVEN PRODUCTS
KAFKA Streams Data Warehouse*Big-data
Model outputs via APIs & Streams
In-house/Third Party: Platforms
AWS, GCP, Cloudera, DataStax, IBM, Azure, Domino labs…
Prescriptive PredictiveCognitive Historical
Models - Deep Learning, Computational Pipelines, Classification & Simulation Engines
Turn Data into Actionable Insights
45. Recommendations
• Keep it simple, follow AWS guidance
• Choose the appropriate mode
• Ensure tooling can read file system size
• Plan for redeployments
• Check AWS limits
• Plan backup/recovery early
• Know performance model
46. Takeaways
• Simple setup
• No Management!
• Usage based performance
• Usage based cost
• Almost unlimited scale
• Remember it is just a shared filesystem