DWH Monitoring System

1
Monitoring & Incident
Management for Yamato
2023
Yamato monitoring system

2
Agenda
● Overview
● Monitoring
● Incident Management
● Monitoring technical details
● Incident Management technical details
● Concerns and next steps
● QA

3
Overview
The monitoring system components are based in a monitoring dashboard to analyse the performance and the
Incident Management to be reactive to the incidents.

4
Monitoring
Yamato platform dashboard is oriented to know the current status and the historical data oriented in a
performance view:
● General overview
● DWH: Infrastructure
● DWH: Workload

5
Monitoring
Yamato platform jobs dashboard is oriented to know the Jobs historical data when they are completed:
● Completed jobs
● Instrumented jobs
● Maintenance jobs

6
Incident Management
Pagerduty is collecting alarms sent by Newrelic.
● Yamato INFRA collects alarms from Yamato Redshift Cluster, WLM Shift Enforcer, …
● Yamato DEV collects alarms which are developed on top of the infrastructure (maintenance, RDL (livesync
and hydra), other pipelines, …)
Every alarm have defined the prioritisation ( high, low and info notifications)
To review the IM SLA’s use the PD Insight

7
Incident Management
The info notifications are not an incident and they can be reviewed in the Alerts panel

8
Incident Management
The audience for the incidences are set to EU Data team which has two
schedules to manage the incident escalation.
● Level 1 (data team developers) receives low and high incidents
● Level 2 (data team managers) receives high incidents if they are
not acknowledge by Level 1
Note: The data-team is managing infrastructure and development
incidents that hey own. SRE are focused in infra that the own. This about
the nature of the incidents, don’t confuse :)
Note: The purpose of Level 2 is not to resolve an incident.

10
Monitoring Tech details
The dashboard use Cloudwatch metrics converted to NR metrics. It uses metric streamer.
Also we add Custom Metrics from CW
The custom metrics uses the field label of the Redshift system tables. This approach is used by odyn dags and it
allows us to identify the jobs name and the job type
This is applied in all the instrumented jobs (livesync, hydra, amnesia) and the odyn dags.
The rest will show as unknown

11
The Custom metrics importer is pushing two types of metrics:
● Completed metrics: These metrics are about queries have finished aggregated in a job Level. The
dimensions are jobType, jobName and Username. Nowadays is running every 30 minutes using the super
admin queue.
"JobQueries","JobCPUUtilization","JobExecutionTime","JobQueueTime","JobBlocksRead","JobTempBlocksToDisk","JobSpectrumUsag
e","JobNestedLoopJoinRowCount","JobReturnRowCount","JobJoinRowCount","JobMaxSegmentExecutionTime","JobMaxSegmentIo
Skew","JobConcurrencyScalingTime","JobResultCachingRatio","JobWorkmem","JobSessions"
● Inflight metrics: These metrics are in memory queries aggregated in a job level. The dimensions are
jobType, jobName, QueueName and Username. It is running every 1 minutes using the super admin queue.
"InflightJobQueries","InflightJobCPUUtilization","InflightJobExecutionTime","InflightJobQueueTime","InflightJobBlocksRead","Inflight
JobTempBlocksToDisk","InflightJobSpectrumUsage","InflightJobNestedLoopJoinRowCount","InflightJobReturnRowCount","InflightJ
obJoinRowCount","InflightJobMaxSegmentExecutionTime","InflightJobMaxSegmentIoSkew","InflightJobConcurrencyScalingTime","I
nflightJobWorkmem","InflightJobSessions"

12
The NR alarms are defined in these three policies:
● Infra Alert policy: Redshift, lambda enforcer
● Dev Job alert policy: livesync, hydra, amnesia, maintenance, custom metrics importer
● Custom Metric alert policy: Inflight Job metrics

13
Every policy recover the information and pass it to PD using NR notification template.
● DEV Job notification
● INFRA notification
● DEV Custom Metric Job notification

15
IM Tech details
Newrelic incident lifecycle has auto-resolve incident when the condition of the alarm is recovered. This is a good
approach for the infrastructure incident but it is not acceptable for developer incidents which requires a manual
intervention to close it.
The orchestration rules in PD can disable the auto-resolve behaviour
● The infrastructure service allow auto-resolve an incident from NR
● The developer service disallow auto-resolve an incident from NR

16
IM Tech details
To manage the prioritisation every alarm has defined pdUrgency attribute and/or pdPriority which, through the
orchestration rules, set the severity and priority in PD.
The services in PD are configured based in the severity of the alert.

17
IM Tech details
● The INFRA alarms has defined the pdUrgency and pdPriority attributes in NR

18
IM Tech details
● The Job DEV alarms for the APM (instrumented jobs) has defined the prioritisation attributes using and
environment variable that you can set at level of job pipeline
● The Job DEV alarms for the Maintenance Jobs has defined the prioritisation in the NR alarm.

19
IM Tech details
● The Custom Metrics DEV alarms has defined the prioritisation in the NR alarm too.

21
Concerns and next steps
● Custom metrics brings a valuable information but this has a cost in terms of performance. But in IMO these
metrics are fundamental and have to have the biggest priority. The inflight queries are using the 6% of the
time and the completed queries are taking 2,2% of total time.
● We need to know every error which is happening, try to not hide trough retries. The alarm noise could be
reduced grouping it, so as not to be overwhelmed.
Next steps
● Add in other pipelines the query trace to have better visibility in AWS Custom Metrics.
● Integrate YUNS as PD extension.
● Integrate Redshift event notification to NR and PD.

DWH Monitoring System

Recommended

Recommended

More Related Content

Similar to DWH Monitoring System

Similar to DWH Monitoring System (20)

Recently uploaded

Recently uploaded (20)

DWH Monitoring System