SlideShare a Scribd company logo
1 of 36
Monitoring micro-services platform 
Boyan Dimitrov, 
Platform Engineering @ Hailo @nathariel
Outline 
• Intro to the Hailo world 
• Platform Overview 
• Monitoring Evolution
The Platform 
Troll a platform by Swinsto101 / CC BY-SA 3.0 / Desaturated 
from original
Platform specifics 
• SOA based on Go ( and Java… ) 
• 1000+ AWS instances spanning multiple regions 
• 160+ services in production 
• Designed specifically for the cloud – different building blocks and 
components will constantly be in flux, broken or unavailable.
eu-west-1 
Proxy Layer 
Message Bus+ 
Go Services 
Java 
Services 
C* 
us-east-1 
Proxy Layer 
Message Bus+ 
Go Services Java 
C* 
Services
Provisioning Service 
CI Pipeline (Janky/Jenkins) 
Amazon S3 
Provisioning Service Provisioning Service 
Provisioning Manager 
Docker Registry 
Inside an environment
A micro-service under the hood 
Handler platform-layer 
Logic 
Storage 
Library for abstracting service-to- 
service comms 
service-layer 
Self-configuring external 
service adapters 
Service 
Any service gets for free: 
• Provisioning 
• Discovery 
• Configuration 
• Authentication/Authorization 
• A/B testing capabilities 
• Self-configuring connectivity to 
third-party services 
• Monitoring 
• Instrumentation
Mission: 
Define high level platform and business metrics 
Gather as many insights as possible 
Add automatic failover and recovery capabilities 
"A[ollo 8 Launch Control Room” by Tfawls 
/ Desaturated from original
PHP Java 
Host Instance 
Graphite 
Zabbix 
Aspiration vs Reality 
CloudWatch 
Zabbix 
Agent 
StatsD Carbon
Challenges 
• Single StatsD instance and generic graphite setup cannot cope with all the traffic 
(surprise!) 
• No easy way of generating and searching for graphs quickly 
• We didn’t instrument everything 
• “Traditional” monitoring systems can only give basic app insights 
• Se#ing up app templates is a manual daunting process and does not scale 
• No in-depth visibility into our main KPIs 
• No way of identifying platform / release / config / cloud infrastructure changes
Instrumentation++ 
“Airplaine board” by Smithore 
/ Desaturated from original
Host Instance 
Graphite 
Cache 
Zabbix 
Iterate on what we already know 
Relay 
CloudWatch 
CollectD StatsD 
Cache 
Cache 
Zabbix 
Agent
Result 
• Scaling up graphite and moving StatsD to every box allowed us to collect millions 
of metrics 
• Instrumenting everything gives us a lot of insights. 
• Grafana allows us to quickly build, store and search for important graphs. Widely 
adopted by the whole development team! 
Tip: Focus on upper 95th and 99th percentiles and work out from there.
Monitoring & 
Instrumentation
RReatzhiinekl Service 
Monitoring
Provisioning Service 
Message bus 
Monitoring 
Service 
New 
Service 
Publish 
Healthchecks 
Host Instance 
Provisioning Manager 
Binding Discovery 
Provisioning Service 
Host Instance 
Monitoring 
V2
healthcheck.Register(&healthcheck.HealthCheck{! 
Id: “MyHCId”,! 
ServiceName: ServiceName,! 
ServiceVersion: ServiceVersion,! 
Hostname: Hostname,! 
InstanceId: InstanceID,! 
Interval: time.Minute,! 
Checker: myCallbackFunc,! 
Priority: hc.Warning,! 
})!
Service level health checks
Result 
• Service health checks give us in-depth service performance details 
• The monitoring service has a holistic view of our platform health and can identify 
degraded availability zones 
• Developers can identify what is important for their service and track & alert on it.
Trace++ 
Monitoring & 
Instrumentation 
“Abstract conception of network and communication” 
by Leszekglasner / Desaturated from original
Trace Architecture 
CollectD StatsD 
Zabbix 
Agent 
Provisioning Service 
Host Instance 
Phosphor 
Publish 
Trace 
Service 
Dashboards 
Monitoring 
In-memory 
Aggregates 
Optional 
persistant 
storage 
Async 
UDP
Live traffic flows
Live traffic flows
Automatic request tracing
Result 
• Trace incoming requests and pinpoint bo#lenecks & SLA offenders 
• Easily identify problems on the request/response path 
• Quickly find out exactly which services participate on the request path
Robomon
Automated Jobs
Result 
• Identify business impacting issues immediately 
• Highlight the service on the critical path that is most likely responsible for the 
problems
Event Correlation 
“Connection” by A2bb5s 
/ Desaturated from the original
CollectD StatsD 
Zabbix 
Agent 
Provisioning Service 
Host Instance 
Phosphor 
Publish 
c 
Dashboards 
Monitoring 
Persistent 
Storage 
SNS 
Platform 
Events 
Whisper 
Service 
c 
Platform events
Result 
• Answer to the most important “Did anything change?” question 
• Audit trail for any platform changes 
• Holistic view of our platform status
It is not over yet! 
++ Machine Learning 
++ Event source weighting
Thanks! 
PS. We’re hiring! 
@nathariel 
boyan@hailocab.com London DevOps

More Related Content

What's hot

SDN Fundamentals - short presentation
SDN Fundamentals -  short presentationSDN Fundamentals -  short presentation
SDN Fundamentals - short presentationAzhar Khuwaja
 
CNIT 123: Ch 13: Network Protection Systems
CNIT 123: Ch 13: Network Protection SystemsCNIT 123: Ch 13: Network Protection Systems
CNIT 123: Ch 13: Network Protection SystemsSam Bowne
 
Serverless computing and Function-as-a-Service (FaaS)
Serverless computing and Function-as-a-Service (FaaS)Serverless computing and Function-as-a-Service (FaaS)
Serverless computing and Function-as-a-Service (FaaS)Moritz Strube
 
Container Patching: Cloud Native Security Con 2023
Container Patching: Cloud Native Security Con 2023Container Patching: Cloud Native Security Con 2023
Container Patching: Cloud Native Security Con 2023Greg Castle
 
Introduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) OnlineIntroduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) OnlineDenis Voituron
 
Network Design and Security Best Practices
Network Design and Security Best PracticesNetwork Design and Security Best Practices
Network Design and Security Best PracticesMike Sherwood
 
Microsoft Azure Networking Basics
Microsoft Azure Networking BasicsMicrosoft Azure Networking Basics
Microsoft Azure Networking BasicsSai Kishore Naidu
 
WEP/WPA attacks
WEP/WPA attacksWEP/WPA attacks
WEP/WPA attacksHuda Seyam
 
Azure Networking: Innovative Features and Multi-VNet Topologies
Azure Networking: Innovative Features and Multi-VNet TopologiesAzure Networking: Innovative Features and Multi-VNet Topologies
Azure Networking: Innovative Features and Multi-VNet TopologiesMarius Zaharia
 
Firewall PPT
Firewall PPTFirewall PPT
Firewall PPTMytec1
 
Vulnerability and Assessment Penetration Testing
Vulnerability and Assessment Penetration TestingVulnerability and Assessment Penetration Testing
Vulnerability and Assessment Penetration TestingYvonne Marambanyika
 
Continuous Integration
Continuous IntegrationContinuous Integration
Continuous Integrationdrluckyspin
 
The Firewall Policy Hangover: Alleviating Security Management Migraines
The Firewall Policy Hangover: Alleviating Security Management MigrainesThe Firewall Policy Hangover: Alleviating Security Management Migraines
The Firewall Policy Hangover: Alleviating Security Management MigrainesAlgoSec
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Site24x7
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
2 what is the best firewall (sizing)
2 what is the best firewall (sizing)2 what is the best firewall (sizing)
2 what is the best firewall (sizing)Mostafa El Lathy
 

What's hot (20)

Cloud Monitoring
Cloud MonitoringCloud Monitoring
Cloud Monitoring
 
SDN Fundamentals - short presentation
SDN Fundamentals -  short presentationSDN Fundamentals -  short presentation
SDN Fundamentals - short presentation
 
CNIT 123: Ch 13: Network Protection Systems
CNIT 123: Ch 13: Network Protection SystemsCNIT 123: Ch 13: Network Protection Systems
CNIT 123: Ch 13: Network Protection Systems
 
Serverless computing and Function-as-a-Service (FaaS)
Serverless computing and Function-as-a-Service (FaaS)Serverless computing and Function-as-a-Service (FaaS)
Serverless computing and Function-as-a-Service (FaaS)
 
Azure DevOps
Azure DevOpsAzure DevOps
Azure DevOps
 
Container Patching: Cloud Native Security Con 2023
Container Patching: Cloud Native Security Con 2023Container Patching: Cloud Native Security Con 2023
Container Patching: Cloud Native Security Con 2023
 
Introduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) OnlineIntroduction to Team Foundation Server (TFS) Online
Introduction to Team Foundation Server (TFS) Online
 
Network Design and Security Best Practices
Network Design and Security Best PracticesNetwork Design and Security Best Practices
Network Design and Security Best Practices
 
Microsoft Azure Networking Basics
Microsoft Azure Networking BasicsMicrosoft Azure Networking Basics
Microsoft Azure Networking Basics
 
WEP/WPA attacks
WEP/WPA attacksWEP/WPA attacks
WEP/WPA attacks
 
Azure Networking: Innovative Features and Multi-VNet Topologies
Azure Networking: Innovative Features and Multi-VNet TopologiesAzure Networking: Innovative Features and Multi-VNet Topologies
Azure Networking: Innovative Features and Multi-VNet Topologies
 
Firewall PPT
Firewall PPTFirewall PPT
Firewall PPT
 
Vulnerability and Assessment Penetration Testing
Vulnerability and Assessment Penetration TestingVulnerability and Assessment Penetration Testing
Vulnerability and Assessment Penetration Testing
 
Paravirtualization
ParavirtualizationParavirtualization
Paravirtualization
 
Continuous Integration
Continuous IntegrationContinuous Integration
Continuous Integration
 
The Firewall Policy Hangover: Alleviating Security Management Migraines
The Firewall Policy Hangover: Alleviating Security Management MigrainesThe Firewall Policy Hangover: Alleviating Security Management Migraines
The Firewall Policy Hangover: Alleviating Security Management Migraines
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
2 what is the best firewall (sizing)
2 what is the best firewall (sizing)2 what is the best firewall (sizing)
2 what is the best firewall (sizing)
 
Types Of Firewall Security
Types Of Firewall SecurityTypes Of Firewall Security
Types Of Firewall Security
 

Similar to Monitoring microservices platform

Moving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journeyMoving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journeyBoyan Dimitrov
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsDigitalOcean
 
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Adin Ermie
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Amazon Web Services
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overviewcornelia davis
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Introducing ONAP for OpenStack St Louis Meetup
Introducing ONAP for OpenStack St Louis MeetupIntroducing ONAP for OpenStack St Louis Meetup
Introducing ONAP for OpenStack St Louis Meetupdjzook
 
WLCG Grid Infrastructure Monitoring
WLCG Grid Infrastructure MonitoringWLCG Grid Infrastructure Monitoring
WLCG Grid Infrastructure MonitoringJames Casey
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectDevOps.com
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
LogicMonitor: An Overview
LogicMonitor: An Overview LogicMonitor: An Overview
LogicMonitor: An Overview James McCabe
 
Streaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamStreaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamInformaticaMarketplace
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
 
Azure Cloud Application Development Workshop - UGIdotNET
Azure Cloud Application Development Workshop - UGIdotNETAzure Cloud Application Development Workshop - UGIdotNET
Azure Cloud Application Development Workshop - UGIdotNETLorenzo Barbieri
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Uri Cohen
 
AMB110: IT Asset Management – How to Start When You Don’t Know Where to Start
AMB110: IT Asset Management – How to Start When You Don’t Know Where to StartAMB110: IT Asset Management – How to Start When You Don’t Know Where to Start
AMB110: IT Asset Management – How to Start When You Don’t Know Where to StartIvanti
 
Microservices: next-steps
Microservices: next-stepsMicroservices: next-steps
Microservices: next-stepsBoyan Dimitrov
 
z Technical Summit Track 3 Session 4 Developing mobilefirst app for z
z Technical Summit Track 3 Session 4 Developing mobilefirst app for zz Technical Summit Track 3 Session 4 Developing mobilefirst app for z
z Technical Summit Track 3 Session 4 Developing mobilefirst app for znick_garrod
 
Infrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsInfrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsLior Kamrat
 

Similar to Monitoring microservices platform (20)

Moving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journeyMoving to microservices – a technology and organisation transformational journey
Moving to microservices – a technology and organisation transformational journey
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult Steps
 
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Introducing ONAP for OpenStack St Louis Meetup
Introducing ONAP for OpenStack St Louis MeetupIntroducing ONAP for OpenStack St Louis Meetup
Introducing ONAP for OpenStack St Louis Meetup
 
WLCG Grid Infrastructure Monitoring
WLCG Grid Infrastructure MonitoringWLCG Grid Infrastructure Monitoring
WLCG Grid Infrastructure Monitoring
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-Architect
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
LogicMonitor: An Overview
LogicMonitor: An Overview LogicMonitor: An Overview
LogicMonitor: An Overview
 
Streaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamStreaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data Stream
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Azure Cloud Application Development Workshop - UGIdotNET
Azure Cloud Application Development Workshop - UGIdotNETAzure Cloud Application Development Workshop - UGIdotNET
Azure Cloud Application Development Workshop - UGIdotNET
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014
 
AMB110: IT Asset Management – How to Start When You Don’t Know Where to Start
AMB110: IT Asset Management – How to Start When You Don’t Know Where to StartAMB110: IT Asset Management – How to Start When You Don’t Know Where to Start
AMB110: IT Asset Management – How to Start When You Don’t Know Where to Start
 
Microservices: next-steps
Microservices: next-stepsMicroservices: next-steps
Microservices: next-steps
 
z Technical Summit Track 3 Session 4 Developing mobilefirst app for z
z Technical Summit Track 3 Session 4 Developing mobilefirst app for zz Technical Summit Track 3 Session 4 Developing mobilefirst app for z
z Technical Summit Track 3 Session 4 Developing mobilefirst app for z
 
Infrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsInfrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & Tools
 

More from Boyan Dimitrov

Complex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWSComplex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWSBoyan Dimitrov
 
Complex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWSComplex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWSBoyan Dimitrov
 
Building Highly Sophisticated Environments for Security and Compliance on AWS
Building Highly Sophisticated Environments for Security and Compliance on AWSBuilding Highly Sophisticated Environments for Security and Compliance on AWS
Building Highly Sophisticated Environments for Security and Compliance on AWSBoyan Dimitrov
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesBoyan Dimitrov
 
Anatomy of the modern application stack
Anatomy of the modern application stackAnatomy of the modern application stack
Anatomy of the modern application stackBoyan Dimitrov
 
Patterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWSPatterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWSBoyan Dimitrov
 
Microservices and elastic resource pools with Amazon EC2 Container Service
Microservices and elastic resource pools with Amazon EC2 Container ServiceMicroservices and elastic resource pools with Amazon EC2 Container Service
Microservices and elastic resource pools with Amazon EC2 Container ServiceBoyan Dimitrov
 
Scaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSScaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSBoyan Dimitrov
 
Scaling from 1 to 10 million users - Hailo
Scaling from 1 to 10 million users - HailoScaling from 1 to 10 million users - Hailo
Scaling from 1 to 10 million users - HailoBoyan Dimitrov
 

More from Boyan Dimitrov (9)

Complex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWSComplex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWS
 
Complex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWSComplex architectures for authentication and authorization on AWS
Complex architectures for authentication and authorization on AWS
 
Building Highly Sophisticated Environments for Security and Compliance on AWS
Building Highly Sophisticated Environments for Security and Compliance on AWSBuilding Highly Sophisticated Environments for Security and Compliance on AWS
Building Highly Sophisticated Environments for Security and Compliance on AWS
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architectures
 
Anatomy of the modern application stack
Anatomy of the modern application stackAnatomy of the modern application stack
Anatomy of the modern application stack
 
Patterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWSPatterns for building resilient and scalable microservices platform on AWS
Patterns for building resilient and scalable microservices platform on AWS
 
Microservices and elastic resource pools with Amazon EC2 Container Service
Microservices and elastic resource pools with Amazon EC2 Container ServiceMicroservices and elastic resource pools with Amazon EC2 Container Service
Microservices and elastic resource pools with Amazon EC2 Container Service
 
Scaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWSScaling micro-services Architecture on AWS
Scaling micro-services Architecture on AWS
 
Scaling from 1 to 10 million users - Hailo
Scaling from 1 to 10 million users - HailoScaling from 1 to 10 million users - Hailo
Scaling from 1 to 10 million users - Hailo
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Monitoring microservices platform

  • 1. Monitoring micro-services platform Boyan Dimitrov, Platform Engineering @ Hailo @nathariel
  • 2. Outline • Intro to the Hailo world • Platform Overview • Monitoring Evolution
  • 3.
  • 4. The Platform Troll a platform by Swinsto101 / CC BY-SA 3.0 / Desaturated from original
  • 5. Platform specifics • SOA based on Go ( and Java… ) • 1000+ AWS instances spanning multiple regions • 160+ services in production • Designed specifically for the cloud – different building blocks and components will constantly be in flux, broken or unavailable.
  • 6. eu-west-1 Proxy Layer Message Bus+ Go Services Java Services C* us-east-1 Proxy Layer Message Bus+ Go Services Java C* Services
  • 7. Provisioning Service CI Pipeline (Janky/Jenkins) Amazon S3 Provisioning Service Provisioning Service Provisioning Manager Docker Registry Inside an environment
  • 8. A micro-service under the hood Handler platform-layer Logic Storage Library for abstracting service-to- service comms service-layer Self-configuring external service adapters Service Any service gets for free: • Provisioning • Discovery • Configuration • Authentication/Authorization • A/B testing capabilities • Self-configuring connectivity to third-party services • Monitoring • Instrumentation
  • 9. Mission: Define high level platform and business metrics Gather as many insights as possible Add automatic failover and recovery capabilities "A[ollo 8 Launch Control Room” by Tfawls / Desaturated from original
  • 10. PHP Java Host Instance Graphite Zabbix Aspiration vs Reality CloudWatch Zabbix Agent StatsD Carbon
  • 11. Challenges • Single StatsD instance and generic graphite setup cannot cope with all the traffic (surprise!) • No easy way of generating and searching for graphs quickly • We didn’t instrument everything • “Traditional” monitoring systems can only give basic app insights • Se#ing up app templates is a manual daunting process and does not scale • No in-depth visibility into our main KPIs • No way of identifying platform / release / config / cloud infrastructure changes
  • 12. Instrumentation++ “Airplaine board” by Smithore / Desaturated from original
  • 13. Host Instance Graphite Cache Zabbix Iterate on what we already know Relay CloudWatch CollectD StatsD Cache Cache Zabbix Agent
  • 14. Result • Scaling up graphite and moving StatsD to every box allowed us to collect millions of metrics • Instrumenting everything gives us a lot of insights. • Grafana allows us to quickly build, store and search for important graphs. Widely adopted by the whole development team! Tip: Focus on upper 95th and 99th percentiles and work out from there.
  • 17. Provisioning Service Message bus Monitoring Service New Service Publish Healthchecks Host Instance Provisioning Manager Binding Discovery Provisioning Service Host Instance Monitoring V2
  • 18. healthcheck.Register(&healthcheck.HealthCheck{! Id: “MyHCId”,! ServiceName: ServiceName,! ServiceVersion: ServiceVersion,! Hostname: Hostname,! InstanceId: InstanceID,! Interval: time.Minute,! Checker: myCallbackFunc,! Priority: hc.Warning,! })!
  • 20. Result • Service health checks give us in-depth service performance details • The monitoring service has a holistic view of our platform health and can identify degraded availability zones • Developers can identify what is important for their service and track & alert on it.
  • 21. Trace++ Monitoring & Instrumentation “Abstract conception of network and communication” by Leszekglasner / Desaturated from original
  • 22. Trace Architecture CollectD StatsD Zabbix Agent Provisioning Service Host Instance Phosphor Publish Trace Service Dashboards Monitoring In-memory Aggregates Optional persistant storage Async UDP
  • 26. Result • Trace incoming requests and pinpoint bo#lenecks & SLA offenders • Easily identify problems on the request/response path • Quickly find out exactly which services participate on the request path
  • 29. Result • Identify business impacting issues immediately • Highlight the service on the critical path that is most likely responsible for the problems
  • 30. Event Correlation “Connection” by A2bb5s / Desaturated from the original
  • 31. CollectD StatsD Zabbix Agent Provisioning Service Host Instance Phosphor Publish c Dashboards Monitoring Persistent Storage SNS Platform Events Whisper Service c Platform events
  • 32.
  • 33.
  • 34. Result • Answer to the most important “Did anything change?” question • Audit trail for any platform changes • Holistic view of our platform status
  • 35. It is not over yet! ++ Machine Learning ++ Event source weighting
  • 36. Thanks! PS. We’re hiring! @nathariel boyan@hailocab.com London DevOps