6. pivotal.io/roadshow #cnr
Jamie Dimon, CEO JPMC
Source: JPMC Annual Shareholder Letter (2015)
“Silicon Valley is coming… and they want to eat
our lunch.”
42. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Ben Treynor, Founder of Google’s Site Reliability Team
“Site Reliability Engineering is what happens
when you ask a software engineer to design an
operations function.”
43. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Dave Rensin, Director of Google Customer Reliability Engineering
“Customer Reliability Engineering’s mission is to
create a shared operational fate between Google
and our Google Cloud Platform customers.”
53. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
BOSH Release
54. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
BOSH Release
12 Factor
55. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
BOSH Release
12 Factor
56. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
57. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
58. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
59. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
60. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
61. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
62. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
Programmable compute, storage & networking
63. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
Programmable compute, storage & networking
64. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
Programmable Infrastructure
65. pivotal.io/roadshow #cnr
Architecture Process Culture Platform
Pivotal Cloud Foundry Elastic Runtime
Pivotal Cloud Foundry Operations Manager
Spring Boot and Spring Cloud Services
Cloud Provider Interface (CPI)
BOSH Release
12 Factor
Programmable Infrastructure
72. pivotal.io/roadshow #cnr
Verma et al, “Large-scale cluster management at Google with Borg”
“Almost every task run under Borg contains a
built-in HTTP server that publishes information
about the health of the task and thousands of
performance metrics (e.g., RPC latencies).”
Observability
77. pivotal.io/roadshow #cnr
If a system should be 99.99% available then it
can be 0.01% unavailable.
If we have error budget left development can take
risks. If not we have to fix it.
SLAs – Error Budgets
79. pivotal.io/roadshow #cnr
Service Level Objective: 99.99% of requests return
under 50ms.
The error budget allows for 0.01% of requests to
exceed the SLO.
Error Budgets – Latency
87. pivotal.io/roadshow #cnr
Service Reliability Hierarchy
Monitoring
Incident Response
Post Mortem / Root Cause Analysis
Testing / Release Procedure
Capacity Planning
Development
Product
89. pivotal.io/roadshow #cnr
Susan J. Fowler, “Production-Ready Microservices”
“Every µService at Uber should be stable, reliable,
scalable, fault tolerant, performant, monitored,
documented, and prepared for any catastrophe.”
90. pivotal.io/roadshow #cnr
A distributed system cannot simultaneously have
consistent views of the data at each node and
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
91. pivotal.io/roadshow #cnr
A distributed system cannot simultaneously have
consistent views of the data at each node and
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
92. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
93. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
94. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
Requests aren’t
being served!
95. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
Requests aren’t
being served!
Unavailable!
96. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
97. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
Serving requests
like normal!
98. pivotal.io/roadshow #cnr
availability of the data at each node if the
network becomes partitioned.
The CAP Theorem
A distributed system cannot simultaneously have
consistent views of the data at each node and
Serving requests
like normal!
Inconsistent!
99. pivotal.io/roadshow #cnr
Raymond Blum and Rhandeev Singh, “Site Reliability Engineering”
“Data integrity is a function of availability of a
given entity over its lifetime. This is analogous to
system uptime and even more critical.”
100. pivotal.io/roadshow #cnr
Raymond Blum and Rhandeev Singh, “Site Reliability Engineering”
“Data availability must be a foremost concern of
any data-centric system.”
101. pivotal.io/roadshow #cnr
Raymond Blum and Rhandeev Singh, “Site Reliability Engineering”
“From the user’s point of view, data integrity
without expected and regular data availability is
effectively the same as having no data at all.”
108. pivotal.io/roadshow #cnr
Wikipedia Article “Operability”
“Operability is the ability to keep an equipment, a
system, or a whole industrial installation in a safe
and reliable functioning condition, according to
pre-defined operational requirements.”
What is operability?
110. pivotal.io/roadshow #cnr
The ability to deploy to production whenever the
organization chooses without anyone setting
themselves on fire.
Continuous Delivery
112. pivotal.io/roadshow #cnr
It doesn’t matter how beautiful your architecture
is, how easy deployment is, or how great your
culture is if production is a tire fire.
Pivotal Cloud Foundry
113. pivotal.io/roadshow #cnr
No CEO Ever
“I appreciate the progress you made on not
delivering anything.”
Undifferentiated Heavy Lifting
114. pivotal.io/roadshow #cnr
Unique Business Value is the tools, systems, and
processes which improve the unique value your
organization provides.
The only thing that matters
115. pivotal.io/roadshow #cnr
Acacio Cruz and Ashish Bhambhani, “Site Reliability Engineering”
“Provide product development with a platform of
SRE-validated infrastructure, upon which they
can build their systems. This platform will have
the double benefit of being both reliable and
scalable.”
117. pivotal.io/roadshow #cnr
Ben Treynor, Founder of Google’s Site Reliability Team
“The SRE Benediction:
May the Queries Flow,
And the Pagers Remain Silent”