Event Management and Monitoring Framework for HPC Environments using ServiceNow and Prometheus

Nitin Sukhija, Elizabeth Bautista, Owen James, Daniel Gens, Siqi Deng, Yu Lam, Tony Quan, Basil Lalli
{"title":"Event Management and Monitoring Framework for HPC Environments using ServiceNow and Prometheus","authors":"Nitin Sukhija, Elizabeth Bautista, Owen James, Daniel Gens, Siqi Deng, Yu Lam, Tony Quan, Basil Lalli","doi":"10.1145/3415958.3433046","DOIUrl":null,"url":null,"abstract":"The challenge of monitoring and event response management of a high performance computing facility grows significantly as the facilities employs and orchestrates more complex and heterogeneous systems and infrastructure. As the computational components encompassing the HPC facility system increases, the computational staff experiences rise in alert fatigue due to the false alarms and noise related to the similar events generated by monitoring tools. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address the issues of duplication of alerts and alert remediation. However, more automation and integration is needed for collecting, aggregating, correlating, analyzing, managing and visualizing the scale of events that will be generated by the emergent hybrid computing infrastructures. In this paper, we present an event management and monitoring framework that addresses the operational needs of the future pre-exascale systems at the Lawrence Berkeley National Laboratory's National Energy Research Scientific Computing Center (NERSC). The framework integrates the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC with the Prometheus, Grafana and ServiceNow platforms to help identify, diagnose, and resolve incidents in real-time, as well as conduct more thorough post-incident reviews enabled by the intuitive dashboards that provides a single pane of glass console for an efficient operations management and real-time proactive monitoring.","PeriodicalId":198419,"journal":{"name":"Proceedings of the 12th International Conference on Management of Digital EcoSystems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th International Conference on Management of Digital EcoSystems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3415958.3433046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

The challenge of monitoring and event response management of a high performance computing facility grows significantly as the facilities employs and orchestrates more complex and heterogeneous systems and infrastructure. As the computational components encompassing the HPC facility system increases, the computational staff experiences rise in alert fatigue due to the false alarms and noise related to the similar events generated by monitoring tools. The National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory (LBNL) has begun to address the issues of duplication of alerts and alert remediation. However, more automation and integration is needed for collecting, aggregating, correlating, analyzing, managing and visualizing the scale of events that will be generated by the emergent hybrid computing infrastructures. In this paper, we present an event management and monitoring framework that addresses the operational needs of the future pre-exascale systems at the Lawrence Berkeley National Laboratory's National Energy Research Scientific Computing Center (NERSC). The framework integrates the Operations Monitoring and Notification Infrastructure (OMNI) at NERSC with the Prometheus, Grafana and ServiceNow platforms to help identify, diagnose, and resolve incidents in real-time, as well as conduct more thorough post-incident reviews enabled by the intuitive dashboards that provides a single pane of glass console for an efficient operations management and real-time proactive monitoring.
使用ServiceNow和Prometheus的HPC环境事件管理和监控框架
随着高性能计算设施采用和编排更复杂的异构系统和基础设施,监视和事件响应管理的挑战显著增加。随着包含高性能计算设施系统的计算组件的增加,由于与监控工具产生的类似事件相关的假警报和噪声,计算人员的警报疲劳增加。位于劳伦斯伯克利国家实验室(LBNL)的国家能源研究科学计算中心(NERSC)已经开始解决警报重复和警报修复的问题。然而,需要更多的自动化和集成来收集、聚合、关联、分析、管理和可视化事件规模,这些事件将由新兴的混合计算基础设施产生。在本文中,我们提出了一个事件管理和监控框架,以解决劳伦斯伯克利国家实验室国家能源研究科学计算中心(NERSC)未来pre-exascale系统的操作需求。该框架将NERSC的运营监控和通知基础设施(OMNI)与Prometheus、Grafana和ServiceNow平台集成在一起,帮助实时识别、诊断和解决事件,并通过直观的仪表板进行更彻底的事件后审查,仪表板提供了一个单一的玻璃控制台,用于高效的运营管理和实时主动监控。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信