N. Besaw;L. Scheidenbach;J. Dunham;S. Kaur;A. Ohmacht;F. Pizzano;Y. Park
{"title":"Cluster system management","authors":"N. Besaw;L. Scheidenbach;J. Dunham;S. Kaur;A. Ohmacht;F. Pizzano;Y. Park","doi":"10.1147/JRD.2020.2967309","DOIUrl":null,"url":null,"abstract":"Cluster system management (CSM) was co-designed with the Department of Energy Labs to provide the support necessary to effectively manage the Summit and Sierra supercomputers. The CSM system administration tools provide a unified view of a large-scale cluster and the ability to examine and understand data from multiple sources. CSM consists of five components: 1) application programming interfaces (APIs) and infrastructure; 2) Big Data Store; 3) support for reliability, availability, and serviceability (RAS); 4) Diagnostic and Health Check; and 5) support for job management. APIs and infrastructure provide lightweight daemons for compute nodes, hardware and software inventory collection, job accounting, and RAS. Logs, environmental data, and performance data are collected in the Big Data Store for analysis. RAS events can trigger corrective actions by CSM. Diagnostic and Health Check are provided through a diagnostic framework and test results collection. To support job management, CSM coordinates with the Job Step Manager to provide an overlay network of JSM daemons. CSM is an open source and available at \n<uri>https://github.com/IBM/CAST</uri>\n. Documentation can be found at \n<uri>https://cast.readthedocs.io</uri>\n.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2020-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2020.2967309","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8961133/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 3
Abstract
Cluster system management (CSM) was co-designed with the Department of Energy Labs to provide the support necessary to effectively manage the Summit and Sierra supercomputers. The CSM system administration tools provide a unified view of a large-scale cluster and the ability to examine and understand data from multiple sources. CSM consists of five components: 1) application programming interfaces (APIs) and infrastructure; 2) Big Data Store; 3) support for reliability, availability, and serviceability (RAS); 4) Diagnostic and Health Check; and 5) support for job management. APIs and infrastructure provide lightweight daemons for compute nodes, hardware and software inventory collection, job accounting, and RAS. Logs, environmental data, and performance data are collected in the Big Data Store for analysis. RAS events can trigger corrective actions by CSM. Diagnostic and Health Check are provided through a diagnostic framework and test results collection. To support job management, CSM coordinates with the Job Step Manager to provide an overlay network of JSM daemons. CSM is an open source and available at
https://github.com/IBM/CAST
. Documentation can be found at
https://cast.readthedocs.io
.
期刊介绍:
The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals.
Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.