ElastiSim: A Batch-System Simulator for Malleable Workloads

Taylan Özden, Tim Beringer, Arya Mazaheri, H. M. Fard, F. Wolf
{"title":"ElastiSim: A Batch-System Simulator for Malleable Workloads","authors":"Taylan Özden, Tim Beringer, Arya Mazaheri, H. M. Fard, F. Wolf","doi":"10.1145/3545008.3545046","DOIUrl":null,"url":null,"abstract":"As high-performance computing infrastructures move towards exascale, the role of resource and job management systems is more critical now than ever. Simulating batch systems to improve scheduling algorithms and resource management efficiency is an indispensable option, as running large-scale experiments is expensive and time-consuming. Batch-system simulators are responsible for simulating the computing infrastructure and the types of jobs that constitute the workload. In contrast to rigid jobs, malleable jobs can dynamically reconfigure their resources during runtime. Although studies indicate that malleability can improve system performance, no simulator exists to investigate malleable scheduling policies. In this work, we present ElastiSim, a batch-system simulator supporting the combined scheduling of rigid and malleable jobs. To facilitate the simulation, we propose a malleable workload model and introduce a scheduling protocol that enables the evaluation of topology-, I/O-, and progress-aware scheduling algorithms. We validate the scaling behavior of our workload model by comparing training runtimes of various deep-learning models against the results achieved by ElastiSim. We use real-world cluster trace files to generate workloads and simulate various scheduling algorithms (FCFS, SJF, DRF, SRTF) to analyze their implications on the simulated platform. The results demonstrate that real-world executions show the same scaling behavior as our proposed workload model. We further show that ElastiSim can capture the complex interplay between emerging workloads and modern platforms to support algorithm designers by providing consistently meaningful results. ElastiSim is publicly available as an open-source project on https://github.com/elastisim.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

As high-performance computing infrastructures move towards exascale, the role of resource and job management systems is more critical now than ever. Simulating batch systems to improve scheduling algorithms and resource management efficiency is an indispensable option, as running large-scale experiments is expensive and time-consuming. Batch-system simulators are responsible for simulating the computing infrastructure and the types of jobs that constitute the workload. In contrast to rigid jobs, malleable jobs can dynamically reconfigure their resources during runtime. Although studies indicate that malleability can improve system performance, no simulator exists to investigate malleable scheduling policies. In this work, we present ElastiSim, a batch-system simulator supporting the combined scheduling of rigid and malleable jobs. To facilitate the simulation, we propose a malleable workload model and introduce a scheduling protocol that enables the evaluation of topology-, I/O-, and progress-aware scheduling algorithms. We validate the scaling behavior of our workload model by comparing training runtimes of various deep-learning models against the results achieved by ElastiSim. We use real-world cluster trace files to generate workloads and simulate various scheduling algorithms (FCFS, SJF, DRF, SRTF) to analyze their implications on the simulated platform. The results demonstrate that real-world executions show the same scaling behavior as our proposed workload model. We further show that ElastiSim can capture the complex interplay between emerging workloads and modern platforms to support algorithm designers by providing consistently meaningful results. ElastiSim is publicly available as an open-source project on https://github.com/elastisim.
弹性工作负载的批处理系统模拟器
随着高性能计算基础设施向百亿亿级发展,资源和作业管理系统的作用比以往任何时候都更加重要。模拟批处理系统以提高调度算法和资源管理效率是一种必不可少的选择,因为运行大规模实验既昂贵又耗时。批处理系统模拟器负责模拟计算基础设施和构成工作负载的作业类型。与刚性作业相比,可塑作业可以在运行时动态地重新配置它们的资源。虽然研究表明可延性可以提高系统性能,但目前还没有研究可延性调度策略的模拟器。在这项工作中,我们提出了ElastiSim,一个支持刚性和柔性作业组合调度的批处理系统模拟器。为了便于模拟,我们提出了一个可伸缩的工作负载模型,并引入了一个调度协议,该协议允许对拓扑、I/O和进度感知调度算法进行评估。我们通过比较各种深度学习模型的训练运行时间与ElastiSim实现的结果来验证我们的工作负载模型的扩展行为。我们使用真实的集群跟踪文件来生成工作负载,并模拟各种调度算法(FCFS, SJF, DRF, SRTF)来分析它们在模拟平台上的影响。结果表明,实际执行显示出与我们建议的工作负载模型相同的扩展行为。我们进一步表明,ElastiSim可以捕获新兴工作负载和现代平台之间复杂的相互作用,通过提供一致的有意义的结果来支持算法设计者。ElastiSim是一个公开的开源项目,可以在https://github.com/elastisim上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信