Cluster system management

IF 1.3 4区 计算机科学 Q1 Computer Science
N. Besaw;L. Scheidenbach;J. Dunham;S. Kaur;A. Ohmacht;F. Pizzano;Y. Park
{"title":"Cluster system management","authors":"N. Besaw;L. Scheidenbach;J. Dunham;S. Kaur;A. Ohmacht;F. Pizzano;Y. Park","doi":"10.1147/JRD.2020.2967309","DOIUrl":null,"url":null,"abstract":"Cluster system management (CSM) was co-designed with the Department of Energy Labs to provide the support necessary to effectively manage the Summit and Sierra supercomputers. The CSM system administration tools provide a unified view of a large-scale cluster and the ability to examine and understand data from multiple sources. CSM consists of five components: 1) application programming interfaces (APIs) and infrastructure; 2) Big Data Store; 3) support for reliability, availability, and serviceability (RAS); 4) Diagnostic and Health Check; and 5) support for job management. APIs and infrastructure provide lightweight daemons for compute nodes, hardware and software inventory collection, job accounting, and RAS. Logs, environmental data, and performance data are collected in the Big Data Store for analysis. RAS events can trigger corrective actions by CSM. Diagnostic and Health Check are provided through a diagnostic framework and test results collection. To support job management, CSM coordinates with the Job Step Manager to provide an overlay network of JSM daemons. CSM is an open source and available at \n<uri>https://github.com/IBM/CAST</uri>\n. Documentation can be found at \n<uri>https://cast.readthedocs.io</uri>\n.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2020.2967309","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8961133/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 3

Abstract

Cluster system management (CSM) was co-designed with the Department of Energy Labs to provide the support necessary to effectively manage the Summit and Sierra supercomputers. The CSM system administration tools provide a unified view of a large-scale cluster and the ability to examine and understand data from multiple sources. CSM consists of five components: 1) application programming interfaces (APIs) and infrastructure; 2) Big Data Store; 3) support for reliability, availability, and serviceability (RAS); 4) Diagnostic and Health Check; and 5) support for job management. APIs and infrastructure provide lightweight daemons for compute nodes, hardware and software inventory collection, job accounting, and RAS. Logs, environmental data, and performance data are collected in the Big Data Store for analysis. RAS events can trigger corrective actions by CSM. Diagnostic and Health Check are provided through a diagnostic framework and test results collection. To support job management, CSM coordinates with the Job Step Manager to provide an overlay network of JSM daemons. CSM is an open source and available at https://github.com/IBM/CAST . Documentation can be found at https://cast.readthedocs.io .
集群系统管理
集群系统管理(CSM)与能源部实验室共同设计,为有效管理Summit和Sierra超级计算机提供必要的支持。CSM系统管理工具提供大规模集群的统一视图,以及检查和理解来自多个来源的数据的能力。CSM由五个部分组成:1)应用程序编程接口(api)和基础设施;2)大数据存储;3)对可靠性、可用性和可服务性(RAS)的支持;4)诊断和健康检查;5)支持作业管理。api和基础设施为计算节点、硬件和软件库存收集、作业记帐和RAS提供轻量级守护进程。日志数据、环境数据和性能数据被收集到大数据存储中进行分析。RAS事件可以触发CSM的纠正措施。诊断和运行状况检查通过诊断框架和测试结果集合提供。为了支持作业管理,CSM与作业步骤管理器协调,以提供JSM守护进程的覆盖网络。CSM是开源的,可以在https://github.com/IBM/CAST上获得。文档可以在https://cast.readthedocs.io上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IBM Journal of Research and Development
IBM Journal of Research and Development 工程技术-计算机:硬件
自引率
0.00%
发文量
0
审稿时长
6-12 weeks
期刊介绍: The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals. Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信