SlideShare a Scribd company logo
1 of 25
Download to read offline
ZABBIX FOR 

HPC MONITORING AND
SUPPORT
Mikhail Serkov
Delivery Manager/HPC Engineer
2016
CONFIDENTIAL 2
What’s next?
HPC monitoring – differences from classic support
model
How do we use Zabbix
AGENDA
Overview of the customer infrastructure and software
stack
High Performance Computing – what is it about#1
#4
#5
#2
#3
CONFIDENTIAL 3
• Scientific research in Pharma area:
Bioinformatics, Computational
Chemistry, Drug Discovery, etc.
• About 10k CPU cores used for a
scientific computation.
• Shared clusters - different workflows
could run simultaneously within the
same cluster.
• About 500 different scientific tools.
• Custom software ( Python, Java, R)
Novartis Institute For Biomedical Research (NIBR)
CONFIDENTIAL 4
• Hundreds or even thousands of computation nodes
• Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc)
• Massive parallel computation across the nodes
• Strong requirements for all subsystems on hardware and software level ( storage, network,
power, OS )
• No magic. Linux boxes, shell scripts on a low level ☺
Example of a job submission:
HIGH PERFORMANCE COMPUTING
CONFIDENTIAL 5
OVERVIEW OF THE CUSTOMER INFRASTRUCTURE
250 GPU’s
70TB RAM
35-40KW/Rack
CONFIDENTIAL 6
• 28 CPU cores ( 2 sockets x 14 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 200 GB RAM
• 10 GB Ethernet + InfiniBand interfaces
• 8 GPU cores ( 4 cards x 2 cores each )
• NFS over 10 GB Ehternet
• Lustre over InfiniBand
TYPICAL COMPUTATION NODE CONFIGURATION
CONFIDENTIAL 7
OVERVIEW OF SOFTWARE STACK
• More than 500 of scientific tools
• Bioinformatics, Computation
Chemistry, Xtallography, Molecular
Dynamics, etc
• RHEL6.5
• Univa Grid Engine
• Zabbix 2.4
CONFIDENTIAL 8
• We need information like ‘who, what, when’, not only system metrics.
• Users are allowed to run whatever they want using grid scheduler on the computation
nodes.
• 100% CPU utilization and 100% RAM utilization for node is perfectly fine.
• Node crash – not such a big deal.
• Preventing global issues by using aggregated metrics.
• Metrics not only for monitoring but for a performance analysis.
• Users are having access to the monitoring system ( but restricted ).
HPC MONITORING DIFFERENCES
CONFIDENTIAL 9
• Able to monitor of a huge systems with a lot of metrics
• Flexible
• Out of the box
• Ability to aggregate metrics
• API for a data extraction
• GUI convenient for both support team and scientists
• Autodiscovery
• New nodes automatic configuration
WHY ZABBIX?
CONFIDENTIAL 10
ZABBIX CONFIGURATION
Server configuration:
• 20 CPU cores ( 2 sockets x 10 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 120 GB RAM
Number of hosts: 601
Number of items: ~200k
Number of triggers: ~37k
DB Size: 187GB
CONFIDENTIAL 11
WHAT DO WE MONITOR
Local metrics ( node level ) Global metrics ( cluster level )
All default Linux checks (LA, CPU utilization, RAM,
swap, etc) - agent
Meta CPU utilization – aggregation of CPU utilization
of HPC nodes.
Every single GPU core ( Temperature, Utilization if
possible) - agent
NFS global transmit/retransmit - aggregation of
nodes values
Every single CPU core ( Utilization, Temperature) -
agent
Grid specific – used/active slots, running jobs,
pending jobs, top users - external scripts
NFS shares availability / utilization / mount details -
agent
CPU/Memory oversubscription - aggregation of
nodes values
Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values
HPC jobs - external scripts Pending time - external scripts
... ....
CONFIDENTIAL 12
HPC specific examples
1) Expected utilization VS Real one
Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real
utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing
resources or overload nodes.
Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they
are too different we receive warning.
2) Users on a computation node
Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs,
etc). However we should check if user has job on the node he is logged into.
Solution: We have a trigger that notify us if we have anyone logged on the node with no job running.
Additionally we store a list of logged in users for any single moment.
CONFIDENTIAL 13
HPC specific examples
Pending time probes
It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource
requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have.
Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for
queue state at the moment.
CONFIDENTIAL 14
WHAT DO WE MONITOR: GLOBAL METRICS
Global cluster utilization
CONFIDENTIAL 15
WHAT DO WE MONITOR: GLOBAL METRICS
RAM oversubscription
CONFIDENTIAL 16
WHAT DO WE MONITOR: GLOBAL METRICS
CPU time oversubscription
CONFIDENTIAL 17
WHAT DO WE MONITOR: GLOBAL METRICS
Meta CPU utilization
CONFIDENTIAL 18
WHAT DO WE MONITOR: GLOBAL METRICS
Aggregated cluster status
CONFIDENTIAL 19
WHAT DO WE MONITOR: GLOBAL METRICS
Storage operational metrics
CONFIDENTIAL 20
WHAT DO WE MONITOR: LOCAL METRICS
CONFIDENTIAL 21
WHAT DO WE MONITOR: LOCAL METRICS
CONFIDENTIAL 22
USER ACCESS
We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We
have patched Zabbix to remove all unnecessary data for guest access.
After
Before
CONFIDENTIAL 23
Benefits
• Better understanding of a global issues on the cluster an reasons of why have they happened.
• Great performance indicators for other infrastructure teams ( especially Storage team )
• Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from
Zabbix is helping us to significantly improve performance of jobs.
• Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or
with some job. In most cases we are able to prevent global cluster issues, or at least minimize an
impact.
• One monitoring system for clusters and HPC infrastructure.
• “All in one”. Lower efforts on support/maintain monitoring system(s).
CONFIDENTIAL 24
• Tight integration with Grid HPC software.
• Data analysis using external tools, but with Zabbix data source.
• Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format.
• Automation of jobs profiling using Zabbix API.
WHAT’S NEXT?
CONFIDENTIAL 25
Questions?

More Related Content

What's hot

What's hot (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioningLeveraging Docker for Hadoop build automation and Big Data stack provisioning
Leveraging Docker for Hadoop build automation and Big Data stack provisioning
 
Kubernetes 101 - A Cluster Operating System
Kubernetes 101 - A Cluster Operating SystemKubernetes 101 - A Cluster Operating System
Kubernetes 101 - A Cluster Operating System
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
My Favorite PostgreSQL Books
My Favorite PostgreSQL BooksMy Favorite PostgreSQL Books
My Favorite PostgreSQL Books
 
Introduction to Nginx
Introduction to NginxIntroduction to Nginx
Introduction to Nginx
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Introduction to Kubernetes Workshop
Introduction to Kubernetes WorkshopIntroduction to Kubernetes Workshop
Introduction to Kubernetes Workshop
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
A glimpse of cassandra 4.0 features netflix
A glimpse of cassandra 4.0 features   netflixA glimpse of cassandra 4.0 features   netflix
A glimpse of cassandra 4.0 features netflix
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
Deterministic capacity planning for OpenStack as elastic cloud infrastructureDeterministic capacity planning for OpenStack as elastic cloud infrastructure
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Kubernetes Problem-Solving
Kubernetes Problem-SolvingKubernetes Problem-Solving
Kubernetes Problem-Solving
 

Viewers also liked

General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
OpenStack Foundation
 

Viewers also liked (20)

Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
 
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case Study
 
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
 
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
 
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
 
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with ZabbixZabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
 
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for EveryoneAlexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
 
OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2
 
Openstack高度自动化持续交付
Openstack高度自动化持续交付Openstack高度自动化持续交付
Openstack高度自动化持续交付
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use Cases
 
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
 
NICE Fizzback: Echtzeit-Kundenfeedback
NICE Fizzback: Echtzeit-KundenfeedbackNICE Fizzback: Echtzeit-Kundenfeedback
NICE Fizzback: Echtzeit-Kundenfeedback
 
Social Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes KundenfeedbackSocial Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes Kundenfeedback
 
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
 
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning TalkInaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
 
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning TalkRafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
 

Similar to Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
Mahmoud Hatem
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 

Similar to Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016 (20)

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 

More from Zabbix

More from Zabbix (18)

Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil CommunityZabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
 
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and ZabbixZabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
 
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
 
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
 
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
 
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMPZabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
 
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
 
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
 
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
 
Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
 
Rihards Olups - Zabbix log management
Rihards Olups - Zabbix log managementRihards Olups - Zabbix log management
Rihards Olups - Zabbix log management
 
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and ZabbixZabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016

  • 1. ZABBIX FOR 
 HPC MONITORING AND SUPPORT Mikhail Serkov Delivery Manager/HPC Engineer 2016
  • 2. CONFIDENTIAL 2 What’s next? HPC monitoring – differences from classic support model How do we use Zabbix AGENDA Overview of the customer infrastructure and software stack High Performance Computing – what is it about#1 #4 #5 #2 #3
  • 3. CONFIDENTIAL 3 • Scientific research in Pharma area: Bioinformatics, Computational Chemistry, Drug Discovery, etc. • About 10k CPU cores used for a scientific computation. • Shared clusters - different workflows could run simultaneously within the same cluster. • About 500 different scientific tools. • Custom software ( Python, Java, R) Novartis Institute For Biomedical Research (NIBR)
  • 4. CONFIDENTIAL 4 • Hundreds or even thousands of computation nodes • Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc) • Massive parallel computation across the nodes • Strong requirements for all subsystems on hardware and software level ( storage, network, power, OS ) • No magic. Linux boxes, shell scripts on a low level ☺ Example of a job submission: HIGH PERFORMANCE COMPUTING
  • 5. CONFIDENTIAL 5 OVERVIEW OF THE CUSTOMER INFRASTRUCTURE 250 GPU’s 70TB RAM 35-40KW/Rack
  • 6. CONFIDENTIAL 6 • 28 CPU cores ( 2 sockets x 14 cores each ) − Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz • 200 GB RAM • 10 GB Ethernet + InfiniBand interfaces • 8 GPU cores ( 4 cards x 2 cores each ) • NFS over 10 GB Ehternet • Lustre over InfiniBand TYPICAL COMPUTATION NODE CONFIGURATION
  • 7. CONFIDENTIAL 7 OVERVIEW OF SOFTWARE STACK • More than 500 of scientific tools • Bioinformatics, Computation Chemistry, Xtallography, Molecular Dynamics, etc • RHEL6.5 • Univa Grid Engine • Zabbix 2.4
  • 8. CONFIDENTIAL 8 • We need information like ‘who, what, when’, not only system metrics. • Users are allowed to run whatever they want using grid scheduler on the computation nodes. • 100% CPU utilization and 100% RAM utilization for node is perfectly fine. • Node crash – not such a big deal. • Preventing global issues by using aggregated metrics. • Metrics not only for monitoring but for a performance analysis. • Users are having access to the monitoring system ( but restricted ). HPC MONITORING DIFFERENCES
  • 9. CONFIDENTIAL 9 • Able to monitor of a huge systems with a lot of metrics • Flexible • Out of the box • Ability to aggregate metrics • API for a data extraction • GUI convenient for both support team and scientists • Autodiscovery • New nodes automatic configuration WHY ZABBIX?
  • 10. CONFIDENTIAL 10 ZABBIX CONFIGURATION Server configuration: • 20 CPU cores ( 2 sockets x 10 cores each ) − Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz • 120 GB RAM Number of hosts: 601 Number of items: ~200k Number of triggers: ~37k DB Size: 187GB
  • 11. CONFIDENTIAL 11 WHAT DO WE MONITOR Local metrics ( node level ) Global metrics ( cluster level ) All default Linux checks (LA, CPU utilization, RAM, swap, etc) - agent Meta CPU utilization – aggregation of CPU utilization of HPC nodes. Every single GPU core ( Temperature, Utilization if possible) - agent NFS global transmit/retransmit - aggregation of nodes values Every single CPU core ( Utilization, Temperature) - agent Grid specific – used/active slots, running jobs, pending jobs, top users - external scripts NFS shares availability / utilization / mount details - agent CPU/Memory oversubscription - aggregation of nodes values Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values HPC jobs - external scripts Pending time - external scripts ... ....
  • 12. CONFIDENTIAL 12 HPC specific examples 1) Expected utilization VS Real one Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing resources or overload nodes. Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they are too different we receive warning. 2) Users on a computation node Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs, etc). However we should check if user has job on the node he is logged into. Solution: We have a trigger that notify us if we have anyone logged on the node with no job running. Additionally we store a list of logged in users for any single moment.
  • 13. CONFIDENTIAL 13 HPC specific examples Pending time probes It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have. Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for queue state at the moment.
  • 14. CONFIDENTIAL 14 WHAT DO WE MONITOR: GLOBAL METRICS Global cluster utilization
  • 15. CONFIDENTIAL 15 WHAT DO WE MONITOR: GLOBAL METRICS RAM oversubscription
  • 16. CONFIDENTIAL 16 WHAT DO WE MONITOR: GLOBAL METRICS CPU time oversubscription
  • 17. CONFIDENTIAL 17 WHAT DO WE MONITOR: GLOBAL METRICS Meta CPU utilization
  • 18. CONFIDENTIAL 18 WHAT DO WE MONITOR: GLOBAL METRICS Aggregated cluster status
  • 19. CONFIDENTIAL 19 WHAT DO WE MONITOR: GLOBAL METRICS Storage operational metrics
  • 20. CONFIDENTIAL 20 WHAT DO WE MONITOR: LOCAL METRICS
  • 21. CONFIDENTIAL 21 WHAT DO WE MONITOR: LOCAL METRICS
  • 22. CONFIDENTIAL 22 USER ACCESS We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We have patched Zabbix to remove all unnecessary data for guest access. After Before
  • 23. CONFIDENTIAL 23 Benefits • Better understanding of a global issues on the cluster an reasons of why have they happened. • Great performance indicators for other infrastructure teams ( especially Storage team ) • Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from Zabbix is helping us to significantly improve performance of jobs. • Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or with some job. In most cases we are able to prevent global cluster issues, or at least minimize an impact. • One monitoring system for clusters and HPC infrastructure. • “All in one”. Lower efforts on support/maintain monitoring system(s).
  • 24. CONFIDENTIAL 24 • Tight integration with Grid HPC software. • Data analysis using external tools, but with Zabbix data source. • Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format. • Automation of jobs profiling using Zabbix API. WHAT’S NEXT?