Event Management and Monitoring Framework for HPC Environments using ServiceNow and Prometheus

Proceedings of the 12th International Conference on Management of Digital EcoSystems Pub Date : 2020-11-02 DOI:10.1145/3415958.3433046

Nitin Sukhija, Elizabeth Bautista, Owen James, Daniel Gens, Siqi Deng, Yu Lam, Tony Quan, Basil Lalli

{"title":"Event Management and Monitoring Framework for HPC Environments using ServiceNow and Prometheus","authors":"Nitin Sukhija, Elizabeth Bautista, Owen James, Daniel Gens, Siqi Deng, Yu Lam, Tony Quan, Basil Lalli","doi":"10.1145/3415958.3433046","DOIUrl":null,"url":null,"abstract":"The challenge of monitoring and event response management of a high performance computing facility grows significantly as the facilities employs and orchestrates more complex and heterogeneous systems and infrastructure. As the computational components encompassing the HPC facility system increases, the computational staff experiences rise in alert fatigue due to the false alarms and noise related to the similar events generated by monitoring tools. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address the issues of duplication of alerts and alert remediation. However, more automation and integration is needed for collecting, aggregating, correlating, analyzing, managing and visualizing the scale of events that will be generated by the emergent hybrid computing infrastructures. In this paper, we present an event management and monitoring framework that addresses the operational needs of the future pre-exascale systems at the Lawrence Berkeley National Laboratory's National Energy Research Scientific Computing Center (NERSC). The framework integrates the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC with the Prometheus, Grafana and ServiceNow platforms to help identify, diagnose, and resolve incidents in real-time, as well as conduct more thorough post-incident reviews enabled by the intuitive dashboards that provides a single pane of glass console for an efficient operations management and real-time proactive monitoring.","PeriodicalId":198419,"journal":{"name":"Proceedings of the 12th International Conference on Management of Digital EcoSystems","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th International Conference on Management of Digital EcoSystems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3415958.3433046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

The challenge of monitoring and event response management of a high performance computing facility grows significantly as the facilities employs and orchestrates more complex and heterogeneous systems and infrastructure. As the computational components encompassing the HPC facility system increases, the computational staff experiences rise in alert fatigue due to the false alarms and noise related to the similar events generated by monitoring tools. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address the issues of duplication of alerts and alert remediation. However, more automation and integration is needed for collecting, aggregating, correlating, analyzing, managing and visualizing the scale of events that will be generated by the emergent hybrid computing infrastructures. In this paper, we present an event management and monitoring framework that addresses the operational needs of the future pre-exascale systems at the Lawrence Berkeley National Laboratory's National Energy Research Scientific Computing Center (NERSC). The framework integrates the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC with the Prometheus, Grafana and ServiceNow platforms to help identify, diagnose, and resolve incidents in real-time, as well as conduct more thorough post-incident reviews enabled by the intuitive dashboards that provides a single pane of glass console for an efficient operations management and real-time proactive monitoring.

查看原文本刊更多论文

使用ServiceNow和Prometheus的HPC环境事件管理和监控框架

随着高性能计算设施采用和编排更复杂的异构系统和基础设施，监视和事件响应管理的挑战显著增加。随着包含高性能计算设施系统的计算组件的增加，由于与监控工具产生的类似事件相关的假警报和噪声，计算人员的警报疲劳增加。位于劳伦斯伯克利国家实验室(LBNL)的国家能源研究科学计算中心(NERSC)已经开始解决警报重复和警报修复的问题。然而，需要更多的自动化和集成来收集、聚合、关联、分析、管理和可视化事件规模，这些事件将由新兴的混合计算基础设施产生。在本文中，我们提出了一个事件管理和监控框架，以解决劳伦斯伯克利国家实验室国家能源研究科学计算中心(NERSC)未来pre-exascale系统的操作需求。该框架将NERSC的运营监控和通知基础设施(OMNI)与Prometheus、Grafana和ServiceNow平台集成在一起，帮助实时识别、诊断和解决事件，并通过直观的仪表板进行更彻底的事件后审查，仪表板提供了一个单一的玻璃控制台，用于高效的运营管理和实时主动监控。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 12th International Conference on Management of Digital EcoSystems

自引率

0.00%

发文量