SlideShare a Scribd company logo
1 of 56
How does the Cloud Foundry
Diego Project Run at Scale?
and updates on .NET Support
Who’s this guy?
• Amit Gupta
• https://akgupta.ca
• @amitkgupta84
Who’s this guy?
• Berkeley math grad school… dropout
• Rails consulting… deserter
• now I do BOSH, Cloud Foundry, Diego, etc.
Testing Diego Performance at Scale
• current Diego architecture
• performance testing approach
• test specifications
• test implementation and tools
• results
• bottom line
• next steps
Current Diego Architecture
+
Current Diego Architecture
What’s new-ish?
• consul for service discovery
• receptor (API) to decouple from CC
• SSH proxy for container access
• NATS-less auction
• garden-windows for .NET applications
Current Diego Architecture
Main components:
• etcd ephemeral data store
• consul service discovery
• receptor Diego API
• nsync sync CC desired state w/Diego
• route-emitter sync with gorouter
• converger health mgmt & consistency
• garden containerization
• rep sync garden actual state w/Diego
• auctioneer workload scheduling
Performance Testing Approach
• full end-to-end tests
• do a lot of stuff:
– is it correct, is it performant?
• kill a lot of stuff:
– is it correct, is it performant?
• emit logs and metrics (business as usual)
• plot & visualize
• fix stuff, repeat at higher scale*
Test Specifications
#1: #2:
#3: #4:
Test Specifications
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
n
Test Specifications
• Diego does tasks and long-running processes
• launch 10n, …, 400n tasks:
– workload distribution?
– scheduling time distribution?
– running time distribution?
– success rate?
– growth rate?
• launch 10n, …, 400n-instance LRP:
– same questions…
Test Specifications
• Diego+CF stages and runs apps
• > cf push
• upload source bits
• fetch buildpack and stage droplet (task)
• fetch droplet and run app (LRP)
• dynamic routing
• streaming logs
Test Specifications
• bring up n nodes in parallel
– from each node, push a apps in parallel
– from each node, repeat this for r rounds
• a is always ≈ 20
• r is always = 40
• n starts out = 1
Test Specifications
• the pushed apps have varying characteristics:
– 1-4 instances
– 128M-1024M memory
– 1M-200M source code payload
– 1-20 log lines/second
– crash never vs. every 30 s
Test Specifications
• starting with n=1:
– app instances ≈ 1k
– instances/cell ≈ 100
– memory utilization across cells ≈ 90%
– app instances crashing (by-design) ≈ 10%
Test Specifications
• evaluate:
– workload distribution
– success rate of pushes
– success rate of app routability
– times for all the things in the push lifecycles
– crash recovery behaviour
– all the metrics!
Test Specifications
• kill 10% of cells
– watch metrics for recovery behaviour
• kill moar cells… and etcd
– does system handle excess load gracefully?
• revive everything with > bosh cck
– does system recover gracefully…
– with no further manual intervention?
Test Specifications
– Figure Out What’s Broke –
– Fix Stuff –
– Move On Scale Up & Repeat –
Test Implementation and Tools
• S3 log, graph, plot backups
• ginkgo & gomega testing DSL
• BOSH parallel test-lab deploys
• tmux & ssh run test suites remotely
• papertrail log archives
• datadog metrics visualizations
• cicerone (custom) log visualizations
Results
400 tasks’ lifecycle timelines, dominated by container creation
Results
Maybe some cells’ gardens were running slower?
Results
Grouping by cell shows uniform container creation slowdown
Results
So that’s not it…
Also, what’s with the blue steps?
Let’s visualize logs a couple more ways
Then take stock of the questions raised
Results
Let’s just look at scheduling (ignore container creation, etc.)
Results
Scheduling again, grouped by which API node handled the request
Results
And how about some histograms of all the things?
Results
From the 400-task request from “Fezzik”:
• only 3-4 (out of 10) API nodes handle reqs?
• recording task reqs take increasing time?
• submitting auction reqs sometimes slow?
• later auctions take so long?
• outliers wtf?
• container creation takes increasing time?
Results
• only 3-4 (out of 10) API nodes handle reqs?
– when multiple address requests during DNS lookup, Golang
returns the DNS response to all requests; this results in only 3-4
API endpoint lookups for the whole set of tasks
• recording task reqs take increasing time?
– API servers use an etcd client with throttling on # of concurrent
requests
• submitting auction reqs sometimes slow?
– auction requests require API node to lookup auctioneer address
in etcd, using throttled etcd client
Results
• later auctions take so long?
– reps were taking longer to report their state to auctioneer,
because they were making expensive calls to garden,
sequentially, to determine current resource usage
• outliers wtf?
– combination of missing logs due to papertrail lossiness, +
cicerone handling missing data poorly
• container creation takes increasing time?
– garden team tasked with investigation
Results
Problems can come from:
• our software
– throttled etcd client
– sequential calls to garden
• software we consume
– garden container creation
• “experiment apparatus” (tools and services):
– papertrail lossiness
– cicerone sloppiness
• language runtime
– Golang’s DNS behaviour
Results
Fixed what we could control, and now it’s all garden
Results
Okay, so far, that’s just been
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
Results
Next, the timelines of pushing 1k app instances
Results
• for the fastest pushes
– dominated by red, blue, gold
– i.e. upload source & CC emit “start”, staging process,
upload droplet
• pushes get slower
– growth in green, light blue, fucsia, teal
– i.e. schedule staging, create staging container,
schedule running, create running container
• main concern: why is scheduling slowing down?
Results
• we had a theory (blame app log chattiness)
• reproduced experiment in BOSH-Lite
– with chattiness turned on
– with chattiness turned off
• appeared to work better
• tried it on AWS
• no improvement 
Results
• spelunked through more logs
• SSH’d onto nodes and tried hitting services
• eventually pinpointed it:
– auctioneer asks cells for state
– cell reps ask garden for usage
– garden gets container disk usage  bottleneck
Results
Garden stops sending disk usage stats, scheduling time disappears
Results
Let’s let things stew between
and
Results
Right after all app pushes, decent workload distribution
Results
… an hour later, something pretty bad happened
Results
• cells heartbeat their presence to etcd
• if ttl expires, converger reschedules LRPs
• cells may reappear after their workloads have
been reassigned
• they remain underutilized
• but why do cells disappear in the first place?
• added more logging, hope to catch in n=2 round
Results
With the one lingering question about cell disappearnce, on to n=2
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
?
Results
With 800 concurrent task reqs, found container cleanup garden bug
Results
With 800-instance LRP, found API node request scheduling serially
Results
• we added a story to the garden backlog
• the serial request issue was an easy fix
• then, with n=2 parallel test-lab nodes, we
pushed 2x the apps
– things worked correctly
– system was performant as a whole
– but individual components showed signs of scale
issues
Results
Our “bulk durations” doubled
Results
• nsync fetches state from CC and etcd to make
sure CC desired state is reflected in diego
• converger fetches desired and actual state
from etcd to make sure things are consistent
• route-emitter fetches state from etcd to keep
gorouter in sync
• bulk loop times doubled from n=1
Results
… and this happened again
Results
– the etcd and consul story –
Results
Fast-forward to today
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
? ✓✓
✓ ✓
?
✓✓
✓ ✓
? ✓ ???
Bottom Line
At the highest scale:
• 4000 concurrent tasks ✓
• 4000-instance LRP ✓
• 10k “real app” instances @ 100 instances/cell:
– etcd (ephemeral data store) ✓
– consul (service discovery) ? (… it’s a long story)
– receptor (Diego API) ? (bulk JSON)
– nsync (CC desired state sync) ? (because of receptor)
– route-emitter (gorouter sync) ? (because of receptor)
– garden (containerizer) ✓
– rep (garden actual state sync) ✓
– auctioneer (scheduler) ✓
Next Steps
• Security
– mutual SSL between all components
– encrypting data-at-rest
• Versioning
– handle breaking API changes gracefully
– production hardening
• Optimize data models
– hand-in-hand with versioning
– shrink payload for bulk reqs
– investigate faster encodings; protobufs > JSON
– initial experiments show 100x speedup
Updates on .NET Support
Updates on .NET Support
• what’s currently supported?
– ASP.NET MVC
– nothing too exotic
– most CF/Diego features, e.g. security groups
– VisualStudio plugin, similar to the Eclipse CF plugin for
Java
• what are the limitations?
– some newer Diego features, e.g. SSH
– in α/β stage, dev-only
Updates on .NET Support
• what’s coming up?
– make it easier to deploy Windows cell
– more VisualStudio plugin features
– hardening testing/CI
• further down the line?
– remote debugging
– the “Spring experience”
Updates on .NET Support
• shout outs
– CenturyLink
– HP
• feedback & questions?
– Mark Kropf (PM): mkropf@pivotal.io
– David Morhovich (Lead): dmorhovich@pivotal.io

More Related Content

What's hot

V mware v realize orchestrator 6.0 knowledge transfer kit
V mware v realize orchestrator 6.0 knowledge transfer kitV mware v realize orchestrator 6.0 knowledge transfer kit
V mware v realize orchestrator 6.0 knowledge transfer kitsolarisyougood
 
Cloudfoundry Introduction
Cloudfoundry IntroductionCloudfoundry Introduction
Cloudfoundry IntroductionYitao Jiang
 
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Lucas Jellema
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Chocolatey Software
 
Why Your Digital Transformation Strategy Demands Middleware Modernization
Why Your Digital Transformation Strategy Demands Middleware ModernizationWhy Your Digital Transformation Strategy Demands Middleware Modernization
Why Your Digital Transformation Strategy Demands Middleware ModernizationVMware Tanzu
 
Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers Ravindra Dastikop
 
CF SUMMIT: Partnerships, Business and Cloud Foundry
CF SUMMIT: Partnerships, Business and Cloud FoundryCF SUMMIT: Partnerships, Business and Cloud Foundry
CF SUMMIT: Partnerships, Business and Cloud FoundryNima Badiey
 
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...Arraya Solutions
 
Ensuring Cloud Native Success: Organization Transformation
Ensuring Cloud Native Success:  Organization TransformationEnsuring Cloud Native Success:  Organization Transformation
Ensuring Cloud Native Success: Organization TransformationChloe Jackson
 
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014Amazon Web Services
 
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)VMware Tanzu
 
Technology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureTechnology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureAndrew Schofield
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise Jesus Rodriguez
 
Kubernetes: Dive into the Future of Infrastructure
Kubernetes: Dive into the Future of InfrastructureKubernetes: Dive into the Future of Infrastructure
Kubernetes: Dive into the Future of InfrastructureGlobalLogic Ukraine
 
Deploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud FoundryDeploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud Foundrycornelia davis
 
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...Cloud Native Day Tel Aviv
 
Pivotal Cloud Foundry 2.4: A First Look
Pivotal Cloud Foundry 2.4: A First LookPivotal Cloud Foundry 2.4: A First Look
Pivotal Cloud Foundry 2.4: A First LookVMware Tanzu
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...confluent
 

What's hot (20)

V mware v realize orchestrator 6.0 knowledge transfer kit
V mware v realize orchestrator 6.0 knowledge transfer kitV mware v realize orchestrator 6.0 knowledge transfer kit
V mware v realize orchestrator 6.0 knowledge transfer kit
 
Cloudfoundry Introduction
Cloudfoundry IntroductionCloudfoundry Introduction
Cloudfoundry Introduction
 
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
 
Why Your Digital Transformation Strategy Demands Middleware Modernization
Why Your Digital Transformation Strategy Demands Middleware ModernizationWhy Your Digital Transformation Strategy Demands Middleware Modernization
Why Your Digital Transformation Strategy Demands Middleware Modernization
 
Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers
 
CF SUMMIT: Partnerships, Business and Cloud Foundry
CF SUMMIT: Partnerships, Business and Cloud FoundryCF SUMMIT: Partnerships, Business and Cloud Foundry
CF SUMMIT: Partnerships, Business and Cloud Foundry
 
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
 
What is Serverless Computing?
What is Serverless Computing?What is Serverless Computing?
What is Serverless Computing?
 
Ensuring Cloud Native Success: Organization Transformation
Ensuring Cloud Native Success:  Organization TransformationEnsuring Cloud Native Success:  Organization Transformation
Ensuring Cloud Native Success: Organization Transformation
 
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
 
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
 
Technology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data CaptureTechnology choices for Apache Kafka and Change Data Capture
Technology choices for Apache Kafka and Change Data Capture
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
Kubernetes: Dive into the Future of Infrastructure
Kubernetes: Dive into the Future of InfrastructureKubernetes: Dive into the Future of Infrastructure
Kubernetes: Dive into the Future of Infrastructure
 
Deploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud FoundryDeploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud Foundry
 
Cache-Aside Cloud Design Pattern
Cache-Aside Cloud Design PatternCache-Aside Cloud Design Pattern
Cache-Aside Cloud Design Pattern
 
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
 
Pivotal Cloud Foundry 2.4: A First Look
Pivotal Cloud Foundry 2.4: A First LookPivotal Cloud Foundry 2.4: A First Look
Pivotal Cloud Foundry 2.4: A First Look
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
 

Viewers also liked

Cloud Foundryは何故動くのか
Cloud Foundryは何故動くのかCloud Foundryは何故動くのか
Cloud Foundryは何故動くのかKazuto Kusama
 
Introduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anyninesIntroduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anyninesanynines GmbH
 
Bluemix and DevOps workshop lab
Bluemix and DevOps workshop labBluemix and DevOps workshop lab
Bluemix and DevOps workshop labbenm4nn
 
Cloud Foundry for PHP developers
Cloud Foundry for PHP developersCloud Foundry for PHP developers
Cloud Foundry for PHP developersDaniel Krook
 
今すぐ始めるCloud Foundry #hackt #hackt_k
今すぐ始めるCloud Foundry #hackt #hackt_k今すぐ始めるCloud Foundry #hackt #hackt_k
今すぐ始めるCloud Foundry #hackt #hackt_kToshiaki Maki
 

Viewers also liked (6)

Cloud Foundryは何故動くのか
Cloud Foundryは何故動くのかCloud Foundryは何故動くのか
Cloud Foundryは何故動くのか
 
Introduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anyninesIntroduction into Cloud Foundry and Bosh | anynines
Introduction into Cloud Foundry and Bosh | anynines
 
Bluemix and DevOps workshop lab
Bluemix and DevOps workshop labBluemix and DevOps workshop lab
Bluemix and DevOps workshop lab
 
Cloud Foundry for PHP developers
Cloud Foundry for PHP developersCloud Foundry for PHP developers
Cloud Foundry for PHP developers
 
GO-CFを試してみる
GO-CFを試してみるGO-CFを試してみる
GO-CFを試してみる
 
今すぐ始めるCloud Foundry #hackt #hackt_k
今すぐ始めるCloud Foundry #hackt #hackt_k今すぐ始めるCloud Foundry #hackt #hackt_k
今すぐ始めるCloud Foundry #hackt #hackt_k
 

Similar to How does the Cloud Foundry Diego Project Run at Scale?

Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at ParseTravis Redman
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at ParseMongoDB
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NETDavid Giard
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging振东 刘
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkKarthik Deivasigamani
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 

Similar to How does the Cloud Foundry Diego Project Run at Scale? (20)

Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Benchmarking at Parse
Benchmarking at ParseBenchmarking at Parse
Benchmarking at Parse
 
Advanced Benchmarking at Parse
Advanced Benchmarking at ParseAdvanced Benchmarking at Parse
Advanced Benchmarking at Parse
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
3.2 Streaming and Messaging
3.2 Streaming and Messaging3.2 Streaming and Messaging
3.2 Streaming and Messaging
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 

More from VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

More from VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

How does the Cloud Foundry Diego Project Run at Scale?

  • 1. How does the Cloud Foundry Diego Project Run at Scale? and updates on .NET Support
  • 2. Who’s this guy? • Amit Gupta • https://akgupta.ca • @amitkgupta84
  • 3. Who’s this guy? • Berkeley math grad school… dropout • Rails consulting… deserter • now I do BOSH, Cloud Foundry, Diego, etc.
  • 4. Testing Diego Performance at Scale • current Diego architecture • performance testing approach • test specifications • test implementation and tools • results • bottom line • next steps
  • 6. Current Diego Architecture What’s new-ish? • consul for service discovery • receptor (API) to decouple from CC • SSH proxy for container access • NATS-less auction • garden-windows for .NET applications
  • 7. Current Diego Architecture Main components: • etcd ephemeral data store • consul service discovery • receptor Diego API • nsync sync CC desired state w/Diego • route-emitter sync with gorouter • converger health mgmt & consistency • garden containerization • rep sync garden actual state w/Diego • auctioneer workload scheduling
  • 8. Performance Testing Approach • full end-to-end tests • do a lot of stuff: – is it correct, is it performant? • kill a lot of stuff: – is it correct, is it performant? • emit logs and metrics (business as usual) • plot & visualize • fix stuff, repeat at higher scale*
  • 10. Test Specifications #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10 n
  • 11. Test Specifications • Diego does tasks and long-running processes • launch 10n, …, 400n tasks: – workload distribution? – scheduling time distribution? – running time distribution? – success rate? – growth rate? • launch 10n, …, 400n-instance LRP: – same questions…
  • 12. Test Specifications • Diego+CF stages and runs apps • > cf push • upload source bits • fetch buildpack and stage droplet (task) • fetch droplet and run app (LRP) • dynamic routing • streaming logs
  • 13. Test Specifications • bring up n nodes in parallel – from each node, push a apps in parallel – from each node, repeat this for r rounds • a is always ≈ 20 • r is always = 40 • n starts out = 1
  • 14. Test Specifications • the pushed apps have varying characteristics: – 1-4 instances – 128M-1024M memory – 1M-200M source code payload – 1-20 log lines/second – crash never vs. every 30 s
  • 15. Test Specifications • starting with n=1: – app instances ≈ 1k – instances/cell ≈ 100 – memory utilization across cells ≈ 90% – app instances crashing (by-design) ≈ 10%
  • 16. Test Specifications • evaluate: – workload distribution – success rate of pushes – success rate of app routability – times for all the things in the push lifecycles – crash recovery behaviour – all the metrics!
  • 17. Test Specifications • kill 10% of cells – watch metrics for recovery behaviour • kill moar cells… and etcd – does system handle excess load gracefully? • revive everything with > bosh cck – does system recover gracefully… – with no further manual intervention?
  • 18. Test Specifications – Figure Out What’s Broke – – Fix Stuff – – Move On Scale Up & Repeat –
  • 19. Test Implementation and Tools • S3 log, graph, plot backups • ginkgo & gomega testing DSL • BOSH parallel test-lab deploys • tmux & ssh run test suites remotely • papertrail log archives • datadog metrics visualizations • cicerone (custom) log visualizations
  • 20. Results 400 tasks’ lifecycle timelines, dominated by container creation
  • 21. Results Maybe some cells’ gardens were running slower?
  • 22. Results Grouping by cell shows uniform container creation slowdown
  • 23. Results So that’s not it… Also, what’s with the blue steps? Let’s visualize logs a couple more ways Then take stock of the questions raised
  • 24. Results Let’s just look at scheduling (ignore container creation, etc.)
  • 25. Results Scheduling again, grouped by which API node handled the request
  • 26. Results And how about some histograms of all the things?
  • 27. Results From the 400-task request from “Fezzik”: • only 3-4 (out of 10) API nodes handle reqs? • recording task reqs take increasing time? • submitting auction reqs sometimes slow? • later auctions take so long? • outliers wtf? • container creation takes increasing time?
  • 28. Results • only 3-4 (out of 10) API nodes handle reqs? – when multiple address requests during DNS lookup, Golang returns the DNS response to all requests; this results in only 3-4 API endpoint lookups for the whole set of tasks • recording task reqs take increasing time? – API servers use an etcd client with throttling on # of concurrent requests • submitting auction reqs sometimes slow? – auction requests require API node to lookup auctioneer address in etcd, using throttled etcd client
  • 29. Results • later auctions take so long? – reps were taking longer to report their state to auctioneer, because they were making expensive calls to garden, sequentially, to determine current resource usage • outliers wtf? – combination of missing logs due to papertrail lossiness, + cicerone handling missing data poorly • container creation takes increasing time? – garden team tasked with investigation
  • 30. Results Problems can come from: • our software – throttled etcd client – sequential calls to garden • software we consume – garden container creation • “experiment apparatus” (tools and services): – papertrail lossiness – cicerone sloppiness • language runtime – Golang’s DNS behaviour
  • 31. Results Fixed what we could control, and now it’s all garden
  • 32. Results Okay, so far, that’s just been #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10
  • 33. Results Next, the timelines of pushing 1k app instances
  • 34. Results • for the fastest pushes – dominated by red, blue, gold – i.e. upload source & CC emit “start”, staging process, upload droplet • pushes get slower – growth in green, light blue, fucsia, teal – i.e. schedule staging, create staging container, schedule running, create running container • main concern: why is scheduling slowing down?
  • 35. Results • we had a theory (blame app log chattiness) • reproduced experiment in BOSH-Lite – with chattiness turned on – with chattiness turned off • appeared to work better • tried it on AWS • no improvement 
  • 36. Results • spelunked through more logs • SSH’d onto nodes and tried hitting services • eventually pinpointed it: – auctioneer asks cells for state – cell reps ask garden for usage – garden gets container disk usage  bottleneck
  • 37. Results Garden stops sending disk usage stats, scheduling time disappears
  • 38. Results Let’s let things stew between and
  • 39. Results Right after all app pushes, decent workload distribution
  • 40. Results … an hour later, something pretty bad happened
  • 41. Results • cells heartbeat their presence to etcd • if ttl expires, converger reschedules LRPs • cells may reappear after their workloads have been reassigned • they remain underutilized • but why do cells disappear in the first place? • added more logging, hope to catch in n=2 round
  • 42. Results With the one lingering question about cell disappearnce, on to n=2 #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10 ✓✓ ✓ ✓ ?
  • 43. Results With 800 concurrent task reqs, found container cleanup garden bug
  • 44. Results With 800-instance LRP, found API node request scheduling serially
  • 45. Results • we added a story to the garden backlog • the serial request issue was an easy fix • then, with n=2 parallel test-lab nodes, we pushed 2x the apps – things worked correctly – system was performant as a whole – but individual components showed signs of scale issues
  • 47. Results • nsync fetches state from CC and etcd to make sure CC desired state is reflected in diego • converger fetches desired and actual state from etcd to make sure things are consistent • route-emitter fetches state from etcd to keep gorouter in sync • bulk loop times doubled from n=1
  • 48. Results … and this happened again
  • 49. Results – the etcd and consul story –
  • 50. Results Fast-forward to today #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10 ✓✓ ✓ ✓ ? ✓✓ ✓ ✓ ? ✓✓ ✓ ✓ ? ✓ ???
  • 51. Bottom Line At the highest scale: • 4000 concurrent tasks ✓ • 4000-instance LRP ✓ • 10k “real app” instances @ 100 instances/cell: – etcd (ephemeral data store) ✓ – consul (service discovery) ? (… it’s a long story) – receptor (Diego API) ? (bulk JSON) – nsync (CC desired state sync) ? (because of receptor) – route-emitter (gorouter sync) ? (because of receptor) – garden (containerizer) ✓ – rep (garden actual state sync) ✓ – auctioneer (scheduler) ✓
  • 52. Next Steps • Security – mutual SSL between all components – encrypting data-at-rest • Versioning – handle breaking API changes gracefully – production hardening • Optimize data models – hand-in-hand with versioning – shrink payload for bulk reqs – investigate faster encodings; protobufs > JSON – initial experiments show 100x speedup
  • 53. Updates on .NET Support
  • 54. Updates on .NET Support • what’s currently supported? – ASP.NET MVC – nothing too exotic – most CF/Diego features, e.g. security groups – VisualStudio plugin, similar to the Eclipse CF plugin for Java • what are the limitations? – some newer Diego features, e.g. SSH – in α/β stage, dev-only
  • 55. Updates on .NET Support • what’s coming up? – make it easier to deploy Windows cell – more VisualStudio plugin features – hardening testing/CI • further down the line? – remote debugging – the “Spring experience”
  • 56. Updates on .NET Support • shout outs – CenturyLink – HP • feedback & questions? – Mark Kropf (PM): mkropf@pivotal.io – David Morhovich (Lead): dmorhovich@pivotal.io