{"title":"解决掌握大规模超级计算机综合体的前沿问题","authors":"D. Nikitenko, V. Voevodin, S. Zhumatiy","doi":"10.1145/2903150.2903481","DOIUrl":null,"url":null,"abstract":"Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.","PeriodicalId":226569,"journal":{"name":"Proceedings of the ACM International Conference on Computing Frontiers","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Resolving frontier problems of mastering large-scale supercomputer complexes\",\"authors\":\"D. Nikitenko, V. Voevodin, S. Zhumatiy\",\"doi\":\"10.1145/2903150.2903481\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.\",\"PeriodicalId\":226569,\"journal\":{\"name\":\"Proceedings of the ACM International Conference on Computing Frontiers\",\"volume\":\"150 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2903150.2903481\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2903150.2903481","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Resolving frontier problems of mastering large-scale supercomputer complexes
Managing and administering of large-scale HPC centers is a complicated problem. Using a number of independent tools for resolving its seemingly independent sub problems can become a bottleneck with rapidly increasing scale of systems, number of hardware and software components, variety of user applications and types of licenses, number of users and workgroups, and so on. The developed tool is designed to help resolving routine problems in mastering and administering of any supercomputer center from a scale of a stand-alone system up to the top-rank supercomputer centers that include a number of totally different HPC systems. The toolkit implements a flexibly configurable variety of essential tools in a single interface. It also features useful means of automation for typical administering and management multi-step procedures. Another important design and implementation feature allows installing and using the toolkit without any significant changes to existing administrating tools and system software. The developed tool is not integrated with target machines system software, it is run on a remote server and runs scripts on HPC systems via SSH as a dedicated user with limited access permissions to perform certain actions. This reduces possibility of security issues greatly and takes care of many fault tolerance issues that are in the line of the key challenges on the road to the Exascale. At the same time this allows administrator performing any operations with corresponding to the situation tools, whether using our tools or any other available tool. The approbation of the developed system proved its practicality in HPC center with some Petaflop-level supercomputers, thousands of active researchers from a diversity of institutions within several hundreds of applied projects.