SlideShare a Scribd company logo
1 of 41
Download to read offline
From Grief to Growth: The 7 Stages of
Observability
Alex Gervais
ConFoo Montreal
Feb. 26, 2020
Bonjour-Hi!
Outdoorsy, data-driven, eternal student, not so geeky creative
mind and traveler. Alex is a curious, introverted and humble
character. Working by day as a Senior Software Developer at
Datawire he has many years of savoir-faire building full-stack
systems from cloud infrastructures to backend services and
DevOps tools. Alex is a Kubernetes early adopter who thrives on
collaboration and contributor to many Cloud-Native projects.
@alex_gervais on twitter
alexgervais on github
Case Study
Grief
Confusion
Envy
Excitement
Satisfaction
Fear
Growth
Lessons learned
Table of Contents
Prediction time...
Serverless will take over the world.
The story
only the names have been changed to protect the innocent
Monoliths and Distributed Applications
my-app
web-frontend
the-relational-db
this-other-service
oh-that-third-party
big-data-warehouse
infosec-robot
some-cache
this-other-service
-v2
email-relay
In the Monolith era, we were equipped with
● Log files and tail, grep, and more
● curl, tcpdump, dig, and more
● IDE debuggers
● Some form of APM
● Classic monitoring such as Nagios, Pingdom, and more
Business-critical operations
Who’s monitoring the user experience and satisfaction?
Are we equipped to face traffic bursts?
How is our release process and QA culture impacting release
frequency AND quality?
Who owns each project and component?
Grief
deep sorrow, especially that caused by someone's death
This is soooo complicated...
● We have downtimes, failures, unstable environments and the
infrastructure footprint is increasing
● Everything comes up as a SURPRISE!
● Failures are everywhere: in infrastructure, in applications, in
configuration
● Failures are visible: in degraded performance, in bugs, in
downtime
● Fatigue
To delight our customers with our SaaS offering,
our problem was…
How could we reach operational excellence and
share ownership effectively?
Objectives
● Increase our global comprehension of the system
● Improve the platform stability
● Find owners for each component and involve them in
operational work
● State of the art observability
Confusion
lack of understanding; uncertainty
Blind operations
● We have all these metrics and data sources, but we are still
operating in the dark
● Too many tools, no trustable source of truth
● No runbooks
● Clients report errors, slowness and downtimes before we
catch them… That’s not good!
● An easy prey for vendors
Concepts
Service-Level Indicator - SLI
Service-Level Objective - SLO
Service-Level Agreement - SLA
Error budgets (P99)
Distinguish the types of error:
bugs vs performance; app vs infra vs config
Envy
a feeling of discontented or resentful longing aroused by someone else's possessions,
qualities, or luck
Keep on dreaming
● Case studies, blogs, and vendor pitches
● Conferences and learning from mature organizations
● Is this a miracle?
● Chaos engineering
● No free cycles to improve
Toil
Manual, repetitive, automatable, interrupt-driven and reactive
work with no enduring value.
Unplanned Work
Work In Progress
Adopting a vision
● Business sponsors
● Invest in our platform
● Identified key enablers to reduce toil
Excitement
a feeling of great enthusiasm and eagerness
We set out to improve our tooling
● Introduce new bleeding-edge technologies
● Custom tools
● Many integrations and trials
Satisfaction
fulfillment of one's wishes, expectations, or needs, or the pleasure derived from this
Good is the enemy of great
● We did a lot of custom code and integrations
● We need to define internal standards
● We tried pretty much everything, all tools and solutions had
their chance
● Definition of Done?
Increase in the number of tools...
● Too many tools and solutions
● We are not reducing complexity
● We now need operational knowledge of more tools!
Friction
● (Intrusive) code instrumentation multiple stacks
● Bumping heads with competing standards
● Change of mentality
● Change of habits
Fear
an unpleasant emotion caused by the belief that someone or something is dangerous,
likely to cause pain, or a threat
Living in fear
● No adoption
● Are we covered?
● How’s ownership going?
Unknown unknowns
● Practice… practice… incident management & chaos
engineering
● Distributed tracing data had a lot of potential, but we failed at
extracting any value out of the data we collected
Grief Iterate
Double down on distributed tracing
● Vendor partnership
● Instrument our code
● Attach business metrics and logs to spans
● Collect the data
● Infer metrics from the collected data
● Render in a comprehensive single pane of glass
● No data retention; No sampling
Live service graph
Single trace view
Drill down analytics
Distributed traces
Helped define our SLIs
Visibility on our SLOs
Identify root-causes
Understand failure modes
Go-to dashboard
Trace Context
Growth
1. the process of developing or maturing physically, mentally, or spiritually
2. the process of increasing in amount, value, or importance
Addressing the problem
● Find, assign and route alert to owners
● Reduce the number of moving parts (environments)
● Start at the edge
● “3 pillars”
● Aligned with business objectives
● The power of habits
Thanks!
Credits
[1] The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win,
Gene Kim, Kevin Behr, George Spafford, 2013
[2] https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
[3] https://www.datadoghq.com/blog/monitoring-101-alerting/
[4] https://xkcd.com/2259/
[5] https://landing.google.com/sre/sre-book/chapters/eliminating-toil/
[6]
https://imgix.datadoghq.com/img/blog/net-monitoring-apm/dotnet-monitoring-service-map
v3.png?fit=max
[7] https://vital-leaf.cloudvent.net/paths/gs-resolve-path/step-three
[8] https://github.com/honeycombio/examples/tree/master/kubernetes-envoy-tracing

More Related Content

Similar to [Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - Alex Gervais

DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck VictorOps
 
HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software jamieayre
 
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0Amazon Web Services
 
Rsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AIRsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AISanjana Chowdhury
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellIT Arena
 
TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?Haggai Philip Zagury
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Productioniguazio
 
Security For Humans
Security For HumansSecurity For Humans
Security For Humansconjur_inc
 
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...Splunk
 
SplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunk
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureYshay Yaacobi
 
Making Onboarding Suck Less
Making Onboarding Suck LessMaking Onboarding Suck Less
Making Onboarding Suck LessVMware Tanzu
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Agile Software and DevOps Essentials
Agile Software and DevOps EssentialsAgile Software and DevOps Essentials
Agile Software and DevOps EssentialsNarayanan Subramaniam
 
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk
 
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...Splunk
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineSagi Brody
 

Similar to [Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - Alex Gervais (20)

DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck
 
HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software HIS 2017 Paul Sherwood- towards trustable software
HIS 2017 Paul Sherwood- towards trustable software
 
Security for Humans
Security for HumansSecurity for Humans
Security for Humans
 
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0(SEC402) Enterprise Cloud Security via DevSecOps 2.0
(SEC402) Enterprise Cloud Security via DevSecOps 2.0
 
Rsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AIRsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AI
 
DevSecOps What Why and How
DevSecOps What Why and HowDevSecOps What Why and How
DevSecOps What Why and How
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?TechRadarCon 2022 | Have you built your platform yet ?
TechRadarCon 2022 | Have you built your platform yet ?
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Security For Humans
Security For HumansSecurity For Humans
Security For Humans
 
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
SplunkLive! Frankfurt 2018 - Legacy SIEM to Splunk, How to Conquer Migration ...
 
SplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunkLive! Paris 2018: Legacy SIEM to Splunk
SplunkLive! Paris 2018: Legacy SIEM to Splunk
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 
Real world dev ops
Real world dev opsReal world dev ops
Real world dev ops
 
Making Onboarding Suck Less
Making Onboarding Suck LessMaking Onboarding Suck Less
Making Onboarding Suck Less
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Agile Software and DevOps Essentials
Agile Software and DevOps EssentialsAgile Software and DevOps Essentials
Agile Software and DevOps Essentials
 
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
 
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
SplunkLive! Zurich 2018: Legacy SIEM to Splunk, How to Conquer Migration and ...
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 

More from Ambassador Labs

Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...Ambassador Labs
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Labs
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toilAmbassador Labs
 
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services Ambassador Labs
 
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex GervaisAmbassador Labs
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard LiAmbassador Labs
 
What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0? What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0? Ambassador Labs
 
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes Ambassador Labs
 
Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy Ambassador Labs
 
Telepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for KubernetesTelepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for KubernetesAmbassador Labs
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...Ambassador Labs
 
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...Ambassador Labs
 
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...Ambassador Labs
 
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYCThe Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYCAmbassador Labs
 
Ambassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API GatewayAmbassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API GatewayAmbassador Labs
 
Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh? Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh? Ambassador Labs
 
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)Ambassador Labs
 
Webinar: Code Faster on Kubernetes
Webinar: Code Faster on KubernetesWebinar: Code Faster on Kubernetes
Webinar: Code Faster on KubernetesAmbassador Labs
 
QCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented DevelopmentQCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented DevelopmentAmbassador Labs
 
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Ambassador Labs
 

More from Ambassador Labs (20)

Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
Building Microservice Systems Without Cooking Your Laptop: Going “Remocal” wi...
 
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
Ambassador Developer Office Hours: Summer of Kubernetes Ship Week 1: Intro to...
 
Cloud native development without the toil
Cloud native development without the toilCloud native development without the toil
Cloud native development without the toil
 
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
Webinar: Accelerate Your Inner Dev Loop for Kubernetes Services
 
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
[Confoo Montreal 2020] Build Your Own Serverless with Knative - Alex Gervais
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
 
What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0? What's New in the Ambassador Edge Stack 1.0?
What's New in the Ambassador Edge Stack 1.0?
 
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
Webinar: Effective Management of APIs and the Edge when Adopting Kubernetes
 
Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy Ambassador: Building a Control Plane for Envoy
Ambassador: Building a Control Plane for Envoy
 
Telepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for KubernetesTelepresence - Fast Development Workflows for Kubernetes
Telepresence - Fast Development Workflows for Kubernetes
 
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
 
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
[KubeCon NA 2018] Effective Kubernetes Develop: Turbocharge Your Dev Loop - P...
 
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
 
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYCThe Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
 
Ambassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API GatewayAmbassador Kubernetes-Native API Gateway
Ambassador Kubernetes-Native API Gateway
 
Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh? Micro xchg 2018 - What is a Service Mesh?
Micro xchg 2018 - What is a Service Mesh?
 
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
KubeCon NA 2017: Ambassador and Envoy (Envoy Salon)
 
Webinar: Code Faster on Kubernetes
Webinar: Code Faster on KubernetesWebinar: Code Faster on Kubernetes
Webinar: Code Faster on Kubernetes
 
QCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented DevelopmentQCon SF 2017 - Microservices: Service-Oriented Development
QCon SF 2017 - Microservices: Service-Oriented Development
 
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
 

Recently uploaded

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 

Recently uploaded (20)

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 

[Confoo Montreal 2020] From Grief to Growth: The 7 Stages of Observability - Alex Gervais

  • 1. From Grief to Growth: The 7 Stages of Observability Alex Gervais ConFoo Montreal Feb. 26, 2020
  • 2. Bonjour-Hi! Outdoorsy, data-driven, eternal student, not so geeky creative mind and traveler. Alex is a curious, introverted and humble character. Working by day as a Senior Software Developer at Datawire he has many years of savoir-faire building full-stack systems from cloud infrastructures to backend services and DevOps tools. Alex is a Kubernetes early adopter who thrives on collaboration and contributor to many Cloud-Native projects. @alex_gervais on twitter alexgervais on github
  • 4. Prediction time... Serverless will take over the world.
  • 5. The story only the names have been changed to protect the innocent
  • 6. Monoliths and Distributed Applications my-app web-frontend the-relational-db this-other-service oh-that-third-party big-data-warehouse infosec-robot some-cache this-other-service -v2 email-relay
  • 7.
  • 8. In the Monolith era, we were equipped with ● Log files and tail, grep, and more ● curl, tcpdump, dig, and more ● IDE debuggers ● Some form of APM ● Classic monitoring such as Nagios, Pingdom, and more
  • 9. Business-critical operations Who’s monitoring the user experience and satisfaction? Are we equipped to face traffic bursts? How is our release process and QA culture impacting release frequency AND quality? Who owns each project and component?
  • 10. Grief deep sorrow, especially that caused by someone's death
  • 11. This is soooo complicated... ● We have downtimes, failures, unstable environments and the infrastructure footprint is increasing ● Everything comes up as a SURPRISE! ● Failures are everywhere: in infrastructure, in applications, in configuration ● Failures are visible: in degraded performance, in bugs, in downtime ● Fatigue
  • 12. To delight our customers with our SaaS offering, our problem was… How could we reach operational excellence and share ownership effectively?
  • 13. Objectives ● Increase our global comprehension of the system ● Improve the platform stability ● Find owners for each component and involve them in operational work ● State of the art observability
  • 15. Blind operations ● We have all these metrics and data sources, but we are still operating in the dark ● Too many tools, no trustable source of truth ● No runbooks ● Clients report errors, slowness and downtimes before we catch them… That’s not good! ● An easy prey for vendors
  • 16. Concepts Service-Level Indicator - SLI Service-Level Objective - SLO Service-Level Agreement - SLA Error budgets (P99) Distinguish the types of error: bugs vs performance; app vs infra vs config
  • 17.
  • 18. Envy a feeling of discontented or resentful longing aroused by someone else's possessions, qualities, or luck
  • 19. Keep on dreaming ● Case studies, blogs, and vendor pitches ● Conferences and learning from mature organizations ● Is this a miracle? ● Chaos engineering ● No free cycles to improve
  • 20. Toil Manual, repetitive, automatable, interrupt-driven and reactive work with no enduring value. Unplanned Work Work In Progress
  • 21. Adopting a vision ● Business sponsors ● Invest in our platform ● Identified key enablers to reduce toil
  • 22. Excitement a feeling of great enthusiasm and eagerness
  • 23. We set out to improve our tooling ● Introduce new bleeding-edge technologies ● Custom tools ● Many integrations and trials
  • 24. Satisfaction fulfillment of one's wishes, expectations, or needs, or the pleasure derived from this
  • 25. Good is the enemy of great ● We did a lot of custom code and integrations ● We need to define internal standards ● We tried pretty much everything, all tools and solutions had their chance ● Definition of Done?
  • 26. Increase in the number of tools... ● Too many tools and solutions ● We are not reducing complexity ● We now need operational knowledge of more tools!
  • 27. Friction ● (Intrusive) code instrumentation multiple stacks ● Bumping heads with competing standards ● Change of mentality ● Change of habits
  • 28. Fear an unpleasant emotion caused by the belief that someone or something is dangerous, likely to cause pain, or a threat
  • 29. Living in fear ● No adoption ● Are we covered? ● How’s ownership going?
  • 30. Unknown unknowns ● Practice… practice… incident management & chaos engineering ● Distributed tracing data had a lot of potential, but we failed at extracting any value out of the data we collected
  • 32. Double down on distributed tracing ● Vendor partnership ● Instrument our code ● Attach business metrics and logs to spans ● Collect the data ● Infer metrics from the collected data ● Render in a comprehensive single pane of glass ● No data retention; No sampling
  • 36. Distributed traces Helped define our SLIs Visibility on our SLOs Identify root-causes Understand failure modes Go-to dashboard
  • 38. Growth 1. the process of developing or maturing physically, mentally, or spiritually 2. the process of increasing in amount, value, or importance
  • 39. Addressing the problem ● Find, assign and route alert to owners ● Reduce the number of moving parts (environments) ● Start at the edge ● “3 pillars” ● Aligned with business objectives ● The power of habits
  • 41. Credits [1] The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win, Gene Kim, Kevin Behr, George Spafford, 2013 [2] https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c [3] https://www.datadoghq.com/blog/monitoring-101-alerting/ [4] https://xkcd.com/2259/ [5] https://landing.google.com/sre/sre-book/chapters/eliminating-toil/ [6] https://imgix.datadoghq.com/img/blog/net-monitoring-apm/dotnet-monitoring-service-map v3.png?fit=max [7] https://vital-leaf.cloudvent.net/paths/gs-resolve-path/step-three [8] https://github.com/honeycombio/examples/tree/master/kubernetes-envoy-tracing