A presentation from Networkshop48 by Tony Wildish, cloud bioinformatics application architect, European Bioinformatics Institute
EBI has recently begun a more systematic push into using commercial clouds. This presentation looks at the initial steps taken to migrate workflows into AWS and GCP, discusses some of the problems we have encountered along the way, and describes our plans going forward.
Migrating EBI into the cloud - lessons learned, so far
1. Migrating EBI into the cloud
Lessons learned… so far
Tony Wildish, Cloud bioinformatics lead architect
2. Migrating EBI into the cloud
Lessons learned… so far
Tony Wildish
Cloud Bioinformatics Lead Architect
wildish@ebi.ac.uk
3. EBI and the cloud
• A few EBI teams have made forays into the cloud
• End-of-grant money, PoC prototypes, small services
• Nothing big, sustained…
• Want to make bigger steps into the cloud, but how…?
• This talk: some of the lessons learned so far…
3
4. Why migrate EBI into the cloud?
4
1 PB
1 TB
1 GB
2004 2019
• Data growing exponentially
• Doubling every ~2 years
• No sign of slowing down
• Exotic hardware
• High-memory machines, GPUs
• Hard to utilize well in HPC environment
• Expensive to buy
5. EBI workloads
• Web services
• Upload file, wait for processing, browse results
• Not particularly demanding => not particularly interesting (to me!)
• Batch processing
• N files in, crunch, M files out
• Often periodic, e.g. triggered by upstream release of new version of source data
• Can be very CPU/data intensive (e.g. metagenome assembly: weekslong runtime)
• Dozens of different pipelines, in different groups, in different languages
• Varying historical legacy, code quality…
5
6. Workflows, HPC vs. cloud-native
6
Step 1
Step 2
Step N
Step 1
Step 2
Step N
Step 1
Step 2
Step N
Typical HPC workflow:
• Workflow-oriented
• Inefficient use of resources
• exotic hardware (GPUs…)
oversubscribed, underutilized
• Hard to scale up
Step 1
Step N
Queue
Queue
Queue
Shared FS,
Object store,
database…
Step 2 Step 2 Step 2Step 2
Cloud-native workflow dataflow:
• Dataflow-oriented
• Efficiency per-step
• Easy to scale up/out
• Portable to multi/hybrid-cloud
7. Cost optimization: Data vs. Compute
• Compute is ‘easy’
• Many options for cost optimization
• Reserved instances, spot markets, sustained use discounts
• VM -> containers -> serverless functions
• Many monitoring/advisory tools
• Data is harder
• Can’t ‘turn it off’ to save money
• Tiered storage (‘cold archive’) doesn’t help against exponentially growing data
• Always true that half our data is < 2 years old, so still active
• Harder to estimate costs of data movement (ingress, egress)
7
8. Culture
• ‘Everyone’ agrees it’s a good idea to use cloud
• Few people have the time, knowledge or experience to do it well
• People concerned about:
• Spending ‘real money’ – keep getting asked for ‘cloud credits’
• Re-writing legacy pipelines – no, it’s not going to get easier if you wait
• Maintaining two pipeline versions, one for in-house, one for cloud
• ‘cloud-bursting’ from on-prem not a trivial way to use cloud
8
9. Knowledge, support, expertise
• In-house training program developed last year
• Users need new skills to be able to use cloud well
• Need systematic approach to spreading those skills through EBI
• 1-day program of the basics: Docker/Gitlab/K8s with exercises
• Given to ~200 people now - https://bit.ly/resops-2019
• Support/expertise
• ‘Cloud Consultants’ team
• Consultation, PoC, embed in teams for larger projects
• Management of organization-level infrastructure: billing, IAM, security policies
9
10. Porting pipelines
• Lift-and-shift:
• A cluster in the cloud that looks like the on-premises system
• ✅ Relatively easy to do, lowers entrypoint for users
• ✅ It gets people used to the idea of cloud, lets them start exploring
• ✅ Stepping stone to better ways of doing things
• ❌ Hard to be cost-effective, doesn’t exploit cloud capabilities well
• ❌ Especially true for pipelines that assume large POSIX filesystems
• ❌ Hard to learn anything useful
• ❌ Hard to maintain momentum towards (cost-)efficient solutions
10
11. Scaling up
• Small deployments don’t teach you much
• Need scale, longevity to start seeing cost benefits
• Reserved instances, spot markets, tiered storage etc
• Understanding/controlling costs is a long-term process
• Iterate frequently with owners of deployments
• A cost-tuning process is more important than getting it right first time
• Too many variables to predict
• Establish a culture of review, oversight
11
12. Acces, Accounting, Authorisation
• Who created/uses/used what resource?
• And which group/team they’re in, to aggregate by organization hierarchy
• Who needs what rights?
• E.g. group-level priorities within organization, different group have different needs
• Account/resource management when people start/leave work
• Not trivial in an academic environment c.f. startup/devops-shop
• => need an API for the structure of your organization
• Automate generating corresponding structures in cloud
• Update/verify changes in cloud with source-of-truth
• Hard to do hybrid/multi-cloud smoothly without it
12
13. Summary
• Migrating to cloud: an opportunity for cultural change?
• Training is important, as is a centre of expertise to help keep momentum
• Interest and willingness doesn’t always translate into sustained effort
• Pick your use cases carefully
• Not every use case will teach you something useful
• Simply reproducing on-prem systems in the cloud is not efficient/cost-effective
• Cloud IAM/policies/accounting need to be seamlessly linked to your organization
• Ad-hoc legacy infrastructure in cloud every bit as bad as on-premises
• Build dataflows, not workflows
• Data management is the key to optimizing workflows, reducing costs, scaling up
13
14. 14
“When you invent the ship, you also invent the
shipwreck; when you invent the plane you also invent
the plane crash; and when you invent electricity, you
invent electrocution...”
Paul Virilio