SlideShare a Scribd company logo
1 of 22
1
Monitoring & Incident
Management for Yamato
2023
Yamato monitoring system
2
Agenda
● Overview
● Monitoring
● Incident Management
● Monitoring technical details
● Incident Management technical details
● Concerns and next steps
● QA
3
Overview
The monitoring system components are based in a monitoring dashboard to analyse the performance and the
Incident Management to be reactive to the incidents.
4
Monitoring
Yamato platform dashboard is oriented to know the current status and the historical data oriented in a
performance view:
● General overview
● DWH: Infrastructure
● DWH: Workload
5
Monitoring
Yamato platform jobs dashboard is oriented to know the Jobs historical data when they are completed:
● Completed jobs
● Instrumented jobs
● Maintenance jobs
6
Incident Management
Pagerduty is collecting alarms sent by Newrelic.
● Yamato INFRA collects alarms from Yamato Redshift Cluster, WLM Shift Enforcer, …
● Yamato DEV collects alarms which are developed on top of the infrastructure (maintenance, RDL (livesync
and hydra), other pipelines, …)
Every alarm have defined the prioritisation ( high, low and info notifications)
To review the IM SLA’s use the PD Insight
7
Incident Management
The info notifications are not an incident and they can be reviewed in the Alerts panel
8
Incident Management
The audience for the incidences are set to EU Data team which has two
schedules to manage the incident escalation.
● Level 1 (data team developers) receives low and high incidents
● Level 2 (data team managers) receives high incidents if they are
not acknowledge by Level 1
Note: The data-team is managing infrastructure and development
incidents that hey own. SRE are focused in infra that the own. This about
the nature of the incidents, don’t confuse :)
Note: The purpose of Level 2 is not to resolve an incident.
Monitoring technical
details
10
Monitoring Tech details
The dashboard use Cloudwatch metrics converted to NR metrics. It uses metric streamer.
Also we add Custom Metrics from CW
The custom metrics uses the field label of the Redshift system tables. This approach is used by odyn dags and it
allows us to identify the jobs name and the job type
This is applied in all the instrumented jobs (livesync, hydra, amnesia) and the odyn dags.
The rest will show as unknown
11
Monitoring Tech details
The Custom metrics importer is pushing two types of metrics:
● Completed metrics: These metrics are about queries have finished aggregated in a job Level. The
dimensions are jobType, jobName and Username. Nowadays is running every 30 minutes using the super
admin queue.
"JobQueries","JobCPUUtilization","JobExecutionTime","JobQueueTime","JobBlocksRead","JobTempBlocksToDisk","JobSpectrumUsag
e","JobNestedLoopJoinRowCount","JobReturnRowCount","JobJoinRowCount","JobMaxSegmentExecutionTime","JobMaxSegmentIo
Skew","JobConcurrencyScalingTime","JobResultCachingRatio","JobWorkmem","JobSessions"
● Inflight metrics: These metrics are in memory queries aggregated in a job level. The dimensions are
jobType, jobName, QueueName and Username. It is running every 1 minutes using the super admin queue.
"InflightJobQueries","InflightJobCPUUtilization","InflightJobExecutionTime","InflightJobQueueTime","InflightJobBlocksRead","Inflight
JobTempBlocksToDisk","InflightJobSpectrumUsage","InflightJobNestedLoopJoinRowCount","InflightJobReturnRowCount","InflightJ
obJoinRowCount","InflightJobMaxSegmentExecutionTime","InflightJobMaxSegmentIoSkew","InflightJobConcurrencyScalingTime","I
nflightJobWorkmem","InflightJobSessions"
12
Monitoring Tech details
The NR alarms are defined in these three policies:
● Infra Alert policy: Redshift, lambda enforcer
● Dev Job alert policy: livesync, hydra, amnesia, maintenance, custom metrics importer
● Custom Metric alert policy: Inflight Job metrics
13
Monitoring Tech details
Every policy recover the information and pass it to PD using NR notification template.
● DEV Job notification
● INFRA notification
● DEV Custom Metric Job notification
IM technical details
15
IM Tech details
Newrelic incident lifecycle has auto-resolve incident when the condition of the alarm is recovered. This is a good
approach for the infrastructure incident but it is not acceptable for developer incidents which requires a manual
intervention to close it.
The orchestration rules in PD can disable the auto-resolve behaviour
● The infrastructure service allow auto-resolve an incident from NR
● The developer service disallow auto-resolve an incident from NR
16
IM Tech details
To manage the prioritisation every alarm has defined pdUrgency attribute and/or pdPriority which, through the
orchestration rules, set the severity and priority in PD.
The services in PD are configured based in the severity of the alert.
17
IM Tech details
● The INFRA alarms has defined the pdUrgency and pdPriority attributes in NR
18
IM Tech details
● The Job DEV alarms for the APM (instrumented jobs) has defined the prioritisation attributes using and
environment variable that you can set at level of job pipeline
● The Job DEV alarms for the Maintenance Jobs has defined the prioritisation in the NR alarm.
19
IM Tech details
● The Custom Metrics DEV alarms has defined the prioritisation in the NR alarm too.
Concerns and next
steps
21
Concerns and next steps
● Custom metrics brings a valuable information but this has a cost in terms of performance. But in IMO these
metrics are fundamental and have to have the biggest priority. The inflight queries are using the 6% of the
time and the completed queries are taking 2,2% of total time.
● We need to know every error which is happening, try to not hide trough retries. The alarm noise could be
reduced grouping it, so as not to be overwhelmed.
Next steps
● Add in other pipelines the query trace to have better visibility in AWS Custom Metrics.
● Integrate YUNS as PD extension.
● Integrate Redshift event notification to NR and PD.
Q/A

More Related Content

Similar to DWH Monitoring System

A Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control SystemA Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control SystemIRJET Journal
 
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...IRJET Journal
 
Madhuraveeran g automation application engineer
Madhuraveeran g   automation application engineerMadhuraveeran g   automation application engineer
Madhuraveeran g automation application engineerMadhura Magesh
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareMr. Chanuwan
 
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02NNfamily
 
Cpm 200 c technical performance measures - alleman (ppm)
Cpm 200 c   technical performance measures - alleman (ppm)Cpm 200 c   technical performance measures - alleman (ppm)
Cpm 200 c technical performance measures - alleman (ppm)Glen Alleman
 
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET Journal
 
ARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEMARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEMMOHAMMAD HANNAN
 
PPT of PLC and SCADA
PPT of PLC and SCADAPPT of PLC and SCADA
PPT of PLC and SCADAMohseen1234
 
conrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptxconrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptxjbri1395
 
Annunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature ControlAnnunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature ControlIOSR Journals
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...IRJET Journal
 
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...IRJET Journal
 
Cad cam unit i [pls vis it our blog sres11meches]
Cad cam unit  i [pls vis it our blog sres11meches]Cad cam unit  i [pls vis it our blog sres11meches]
Cad cam unit i [pls vis it our blog sres11meches]Sres IImeches
 
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...IRJET Journal
 

Similar to DWH Monitoring System (20)

A Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control SystemA Distributed Time Triggered Control for a Feedback Control System
A Distributed Time Triggered Control for a Feedback Control System
 
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...IoT Based Project for Railway  Locomotive Monitoring System, Alert on Emergen...
IoT Based Project for Railway Locomotive Monitoring System, Alert on Emergen...
 
Madhuraveeran g automation application engineer
Madhuraveeran g   automation application engineerMadhuraveeran g   automation application engineer
Madhuraveeran g automation application engineer
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded software
 
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
Runtimeperformanceevaluationofembeddedsoftware 100825224539-phpapp02
 
Cpm 200 c technical performance measures - alleman (ppm)
Cpm 200 c   technical performance measures - alleman (ppm)Cpm 200 c   technical performance measures - alleman (ppm)
Cpm 200 c technical performance measures - alleman (ppm)
 
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDLIRJET-  	  Design of Low Power 32- Bit RISC Processor using Verilog HDL
IRJET- Design of Low Power 32- Bit RISC Processor using Verilog HDL
 
ARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEMARDUINO BASED HEART BEAT MONITORING SYSTEM
ARDUINO BASED HEART BEAT MONITORING SYSTEM
 
PPT of PLC and SCADA
PPT of PLC and SCADAPPT of PLC and SCADA
PPT of PLC and SCADA
 
05 uap terminal system issue1.10
05 uap terminal system issue1.1005 uap terminal system issue1.10
05 uap terminal system issue1.10
 
conrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptxconrol_Unit_part_of_computer_architecture.pptx
conrol_Unit_part_of_computer_architecture.pptx
 
Annunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature ControlAnnunciator for Hazard Prevention & Temperature Control
Annunciator for Hazard Prevention & Temperature Control
 
Plc on cnc
Plc on cncPlc on cnc
Plc on cnc
 
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
IRJET- FPGA Implementation of an Improved Watchdog Timer for Safety-Critical ...
 
UNIT 1.pptx
UNIT 1.pptxUNIT 1.pptx
UNIT 1.pptx
 
Alarm management at DeltaV
Alarm management at DeltaVAlarm management at DeltaV
Alarm management at DeltaV
 
DEVENDRAPLC .pptx
DEVENDRAPLC .pptxDEVENDRAPLC .pptx
DEVENDRAPLC .pptx
 
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
Developing Algorithm for Fault Detection and Classification for DC Motor Usin...
 
Cad cam unit i [pls vis it our blog sres11meches]
Cad cam unit  i [pls vis it our blog sres11meches]Cad cam unit  i [pls vis it our blog sres11meches]
Cad cam unit i [pls vis it our blog sres11meches]
 
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
IRJET-A Study of Programmable Logic Controllers (PLC) and Graphical User Inte...
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

DWH Monitoring System

  • 1. 1 Monitoring & Incident Management for Yamato 2023 Yamato monitoring system
  • 2. 2 Agenda ● Overview ● Monitoring ● Incident Management ● Monitoring technical details ● Incident Management technical details ● Concerns and next steps ● QA
  • 3. 3 Overview The monitoring system components are based in a monitoring dashboard to analyse the performance and the Incident Management to be reactive to the incidents.
  • 4. 4 Monitoring Yamato platform dashboard is oriented to know the current status and the historical data oriented in a performance view: ● General overview ● DWH: Infrastructure ● DWH: Workload
  • 5. 5 Monitoring Yamato platform jobs dashboard is oriented to know the Jobs historical data when they are completed: ● Completed jobs ● Instrumented jobs ● Maintenance jobs
  • 6. 6 Incident Management Pagerduty is collecting alarms sent by Newrelic. ● Yamato INFRA collects alarms from Yamato Redshift Cluster, WLM Shift Enforcer, … ● Yamato DEV collects alarms which are developed on top of the infrastructure (maintenance, RDL (livesync and hydra), other pipelines, …) Every alarm have defined the prioritisation ( high, low and info notifications) To review the IM SLA’s use the PD Insight
  • 7. 7 Incident Management The info notifications are not an incident and they can be reviewed in the Alerts panel
  • 8. 8 Incident Management The audience for the incidences are set to EU Data team which has two schedules to manage the incident escalation. ● Level 1 (data team developers) receives low and high incidents ● Level 2 (data team managers) receives high incidents if they are not acknowledge by Level 1 Note: The data-team is managing infrastructure and development incidents that hey own. SRE are focused in infra that the own. This about the nature of the incidents, don’t confuse :) Note: The purpose of Level 2 is not to resolve an incident.
  • 10. 10 Monitoring Tech details The dashboard use Cloudwatch metrics converted to NR metrics. It uses metric streamer. Also we add Custom Metrics from CW The custom metrics uses the field label of the Redshift system tables. This approach is used by odyn dags and it allows us to identify the jobs name and the job type This is applied in all the instrumented jobs (livesync, hydra, amnesia) and the odyn dags. The rest will show as unknown
  • 11. 11 Monitoring Tech details The Custom metrics importer is pushing two types of metrics: ● Completed metrics: These metrics are about queries have finished aggregated in a job Level. The dimensions are jobType, jobName and Username. Nowadays is running every 30 minutes using the super admin queue. "JobQueries","JobCPUUtilization","JobExecutionTime","JobQueueTime","JobBlocksRead","JobTempBlocksToDisk","JobSpectrumUsag e","JobNestedLoopJoinRowCount","JobReturnRowCount","JobJoinRowCount","JobMaxSegmentExecutionTime","JobMaxSegmentIo Skew","JobConcurrencyScalingTime","JobResultCachingRatio","JobWorkmem","JobSessions" ● Inflight metrics: These metrics are in memory queries aggregated in a job level. The dimensions are jobType, jobName, QueueName and Username. It is running every 1 minutes using the super admin queue. "InflightJobQueries","InflightJobCPUUtilization","InflightJobExecutionTime","InflightJobQueueTime","InflightJobBlocksRead","Inflight JobTempBlocksToDisk","InflightJobSpectrumUsage","InflightJobNestedLoopJoinRowCount","InflightJobReturnRowCount","InflightJ obJoinRowCount","InflightJobMaxSegmentExecutionTime","InflightJobMaxSegmentIoSkew","InflightJobConcurrencyScalingTime","I nflightJobWorkmem","InflightJobSessions"
  • 12. 12 Monitoring Tech details The NR alarms are defined in these three policies: ● Infra Alert policy: Redshift, lambda enforcer ● Dev Job alert policy: livesync, hydra, amnesia, maintenance, custom metrics importer ● Custom Metric alert policy: Inflight Job metrics
  • 13. 13 Monitoring Tech details Every policy recover the information and pass it to PD using NR notification template. ● DEV Job notification ● INFRA notification ● DEV Custom Metric Job notification
  • 15. 15 IM Tech details Newrelic incident lifecycle has auto-resolve incident when the condition of the alarm is recovered. This is a good approach for the infrastructure incident but it is not acceptable for developer incidents which requires a manual intervention to close it. The orchestration rules in PD can disable the auto-resolve behaviour ● The infrastructure service allow auto-resolve an incident from NR ● The developer service disallow auto-resolve an incident from NR
  • 16. 16 IM Tech details To manage the prioritisation every alarm has defined pdUrgency attribute and/or pdPriority which, through the orchestration rules, set the severity and priority in PD. The services in PD are configured based in the severity of the alert.
  • 17. 17 IM Tech details ● The INFRA alarms has defined the pdUrgency and pdPriority attributes in NR
  • 18. 18 IM Tech details ● The Job DEV alarms for the APM (instrumented jobs) has defined the prioritisation attributes using and environment variable that you can set at level of job pipeline ● The Job DEV alarms for the Maintenance Jobs has defined the prioritisation in the NR alarm.
  • 19. 19 IM Tech details ● The Custom Metrics DEV alarms has defined the prioritisation in the NR alarm too.
  • 21. 21 Concerns and next steps ● Custom metrics brings a valuable information but this has a cost in terms of performance. But in IMO these metrics are fundamental and have to have the biggest priority. The inflight queries are using the 6% of the time and the completed queries are taking 2,2% of total time. ● We need to know every error which is happening, try to not hide trough retries. The alarm noise could be reduced grouping it, so as not to be overwhelmed. Next steps ● Add in other pipelines the query trace to have better visibility in AWS Custom Metrics. ● Integrate YUNS as PD extension. ● Integrate Redshift event notification to NR and PD.
  • 22. Q/A