Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics

J. Thaler, Woong Shin, S. Roberts, James H. Rogers, Todd J. Rosedahl
{"title":"Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics","authors":"J. Thaler, Woong Shin, S. Roberts, James H. Rogers, Todd J. Rosedahl","doi":"10.1109/HPEC43674.2020.9286239","DOIUrl":null,"url":null,"abstract":"The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"440 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory [5].
高性能计算集群遥测和硬件日志分析的混合方法
在集群系统中,计算机处理节点和处理器核心的数量正在迅速增长。及时发现硬件或环境问题并对其作出反应,可以实现正确的故障隔离,提高服务质量,并延长系统正常运行时间。在性能影响和节点中断的情况下,RAS策略可以指导诸如作业静止或迁移之类的操作。此外,功耗、热信息和利用率指标可用于提供集群能源和冷却效率的改进以及优化的工作安排。本文描述了一个高度可扩展的遥测架构,该架构允许事件聚合、RAS策略的应用,并提供集群控制系统反馈的能力。该体系结构通过包括可编程策略(作为事件流通过分层网络应用于持久性存储)和在可扩展框架中处理传感器遥测来改进现有方法。这种实现已经被证明是健壮的,并在云和高性能计算环境中使用,包括橡树岭国家实验室的4608个节点的Summit系统[5]。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信