CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters

2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT) Pub Date : 2020-09-01 DOI:10.1109/DS-RT50469.2020.9213578

Avinash Maurya, Bogdan Nicolae, Ishan Guliani, M. Mustafa Rafique

{"title":"CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters","authors":"Avinash Maurya, Bogdan Nicolae, Ishan Guliani, M. Mustafa Rafique","doi":"10.1109/DS-RT50469.2020.9213578","DOIUrl":null,"url":null,"abstract":"The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments, acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing infrastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.","PeriodicalId":149260,"journal":{"name":"2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)","volume":"190 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DS-RT50469.2020.9213578","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments, acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing infrastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.

查看原文本刊更多论文

CoSim: HPC数据中心中批处理和按需作业协同调度的模拟器

科学应用日益增长的规模和复杂性正在迅速改变高性能计算(HPC)社区采用的工具、方法和工作流程的生态系统。大数据分析和深度学习作为这个生态系统中各种场景的重要组成部分，正在获得越来越多的关注，例如，实验仪器的转向，通过替代计算加速高保真模拟，以及引导集成搜索。在这种情况下，超级计算基础设施传统上采用的批处理作业模型需要补充支持调度机会主义的按需分析作业，从而导致批处理作业的有效抢占问题，同时使进度损失最小。在本文中，我们设计并实现了一个模拟器CoSim，它可以实时分析延迟机会性按需作业的开始(这会导致更长的分析延迟)和由于批处理作业抢占而导致的进度损失(这是为按需作业腾出空间所必需的)之间产生的权衡。为此，我们提出了一种基于动态规划的算法，该算法具有可预测的性能和可扩展性，使超级计算基础设施调度器能够分析上述权衡并近乎实时地做出决策。与使用Theta pre-Exascale机器痕迹的其他最先进方法相比，我们的方法能够找到最佳解决方案，同时实现高性能和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)

自引率

0.00%

发文量