From Pivotal's Amit Gupta on July 9, 2015, a look at how the Cloud Foundry Diego project runs at scale, and what it took to get there. Offering a look into the Diego project scheduler and the performance testing efforts, all the tools necessary to ensure that Cloud Foundry can scale quickly and effortlessly.
To learn more, visit pivotal.io/platform-as-a-service/pivotal-cloud-foundry
3. Who’s this guy?
• Berkeley math grad school… dropout
• Rails consulting… deserter
• now I do BOSH, Cloud Foundry, Diego, etc.
4. Testing Diego Performance at Scale
• current Diego architecture
• performance testing approach
• test specifications
• test implementation and tools
• results
• bottom line
• next steps
6. Current Diego Architecture
What’s new-ish?
• consul for service discovery
• receptor (API) to decouple from CC
• SSH proxy for container access
• NATS-less auction
• garden-windows for .NET applications
7. Current Diego Architecture
Main components:
• etcd ephemeral data store
• consul service discovery
• receptor Diego API
• nsync sync CC desired state w/Diego
• route-emitter sync with gorouter
• converger health mgmt & consistency
• garden containerization
• rep sync garden actual state w/Diego
• auctioneer workload scheduling
8. Performance Testing Approach
• full end-to-end tests
• do a lot of stuff:
– is it correct, is it performant?
• kill a lot of stuff:
– is it correct, is it performant?
• emit logs and metrics (business as usual)
• plot & visualize
• fix stuff, repeat at higher scale*
11. Test Specifications
• Diego does tasks and long-running processes
• launch 10n, …, 400n tasks:
– workload distribution?
– scheduling time distribution?
– running time distribution?
– success rate?
– growth rate?
• launch 10n, …, 400n-instance LRP:
– same questions…
12. Test Specifications
• Diego+CF stages and runs apps
• > cf push
• upload source bits
• fetch buildpack and stage droplet (task)
• fetch droplet and run app (LRP)
• dynamic routing
• streaming logs
13. Test Specifications
• bring up n nodes in parallel
– from each node, push a apps in parallel
– from each node, repeat this for r rounds
• a is always ≈ 20
• r is always = 40
• n starts out = 1
14. Test Specifications
• the pushed apps have varying characteristics:
– 1-4 instances
– 128M-1024M memory
– 1M-200M source code payload
– 1-20 log lines/second
– crash never vs. every 30 s
15. Test Specifications
• starting with n=1:
– app instances ≈ 1k
– instances/cell ≈ 100
– memory utilization across cells ≈ 90%
– app instances crashing (by-design) ≈ 10%
16. Test Specifications
• evaluate:
– workload distribution
– success rate of pushes
– success rate of app routability
– times for all the things in the push lifecycles
– crash recovery behaviour
– all the metrics!
17. Test Specifications
• kill 10% of cells
– watch metrics for recovery behaviour
• kill moar cells… and etcd
– does system handle excess load gracefully?
• revive everything with > bosh cck
– does system recover gracefully…
– with no further manual intervention?
27. Results
From the 400-task request from “Fezzik”:
• only 3-4 (out of 10) API nodes handle reqs?
• recording task reqs take increasing time?
• submitting auction reqs sometimes slow?
• later auctions take so long?
• outliers wtf?
• container creation takes increasing time?
28. Results
• only 3-4 (out of 10) API nodes handle reqs?
– when multiple address requests during DNS lookup, Golang
returns the DNS response to all requests; this results in only 3-4
API endpoint lookups for the whole set of tasks
• recording task reqs take increasing time?
– API servers use an etcd client with throttling on # of concurrent
requests
• submitting auction reqs sometimes slow?
– auction requests require API node to lookup auctioneer address
in etcd, using throttled etcd client
29. Results
• later auctions take so long?
– reps were taking longer to report their state to auctioneer,
because they were making expensive calls to garden,
sequentially, to determine current resource usage
• outliers wtf?
– combination of missing logs due to papertrail lossiness, +
cicerone handling missing data poorly
• container creation takes increasing time?
– garden team tasked with investigation
30. Results
Problems can come from:
• our software
– throttled etcd client
– sequential calls to garden
• software we consume
– garden container creation
• “experiment apparatus” (tools and services):
– papertrail lossiness
– cicerone sloppiness
• language runtime
– Golang’s DNS behaviour
34. Results
• for the fastest pushes
– dominated by red, blue, gold
– i.e. upload source & CC emit “start”, staging process,
upload droplet
• pushes get slower
– growth in green, light blue, fucsia, teal
– i.e. schedule staging, create staging container,
schedule running, create running container
• main concern: why is scheduling slowing down?
35. Results
• we had a theory (blame app log chattiness)
• reproduced experiment in BOSH-Lite
– with chattiness turned on
– with chattiness turned off
• appeared to work better
• tried it on AWS
• no improvement
36. Results
• spelunked through more logs
• SSH’d onto nodes and tried hitting services
• eventually pinpointed it:
– auctioneer asks cells for state
– cell reps ask garden for usage
– garden gets container disk usage bottleneck
41. Results
• cells heartbeat their presence to etcd
• if ttl expires, converger reschedules LRPs
• cells may reappear after their workloads have
been reassigned
• they remain underutilized
• but why do cells disappear in the first place?
• added more logging, hope to catch in n=2 round
42. Results
With the one lingering question about cell disappearnce, on to n=2
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
?
45. Results
• we added a story to the garden backlog
• the serial request issue was an easy fix
• then, with n=2 parallel test-lab nodes, we
pushed 2x the apps
– things worked correctly
– system was performant as a whole
– but individual components showed signs of scale
issues
47. Results
• nsync fetches state from CC and etcd to make
sure CC desired state is reflected in diego
• converger fetches desired and actual state
from etcd to make sure things are consistent
• route-emitter fetches state from etcd to keep
gorouter in sync
• bulk loop times doubled from n=1
54. Updates on .NET Support
• what’s currently supported?
– ASP.NET MVC
– nothing too exotic
– most CF/Diego features, e.g. security groups
– VisualStudio plugin, similar to the Eclipse CF plugin for
Java
• what are the limitations?
– some newer Diego features, e.g. SSH
– in α/β stage, dev-only
55. Updates on .NET Support
• what’s coming up?
– make it easier to deploy Windows cell
– more VisualStudio plugin features
– hardening testing/CI
• further down the line?
– remote debugging
– the “Spring experience”
56. Updates on .NET Support
• shout outs
– CenturyLink
– HP
• feedback & questions?
– Mark Kropf (PM): mkropf@pivotal.io
– David Morhovich (Lead): dmorhovich@pivotal.io