CanarIO: Sounding the Alarm on IO-Related Performance Degradation

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00018

Michael R. Wyatt, Stephen Herbein, Kathleen Shoga, T. Gamblin, M. Taufer

{"title":"CanarIO: Sounding the Alarm on IO-Related Performance Degradation","authors":"Michael R. Wyatt, Stephen Herbein, Kathleen Shoga, T. Gamblin, M. Taufer","doi":"10.1109/IPDPS47924.2020.00018","DOIUrl":null,"url":null,"abstract":"Users interact with High Performance Computing (HPC) machines through batch systems, which take user job submissions and allocate them to computing resources. While some resource managers have a generalized resource model, in nearly all modern systems, nodes are the only resource managed. Other resources, such as parallel file systems, are also necessary for jobs to make progress, but schedulers are blind to these resources. Facility staff can manually detect critical problems and manually hold jobs that need particular file systems, but this requires manual monitoring. Without human intervention, modern schedulers will happily run jobs whose required resources are not available. As a result, resources are wasted when IO-intensive jobs are scheduled on file systems with degraded performance.We introduce CanarIO, a tool for predicting the IO-sensitivity of HPC jobs and detecting IO-related performance degradation on HPC systems. CanarIO uses a set of \"canary\" IO probes run at regular intervals on the system. Using performance measurements from these jobs, CanarIO builds classifiers that can determine which jobs are IO-sensitive and when file system performance is degraded. We demonstrate the accuracy of our tool with a simulation of system execution using real HPC data. Specifically, we detect 37.5% of IO degradation events and correctly identify >90% of IO-sensitive jobs. We show that with CanarIO predictions we recover >1,500 node-hours in 10 days, with a potential maximum of nearly 10,000 node-hours. CanarIO is the first step necessary for augmenting schedulers to be resource-aware.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"111 1","pages":"73-83"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Users interact with High Performance Computing (HPC) machines through batch systems, which take user job submissions and allocate them to computing resources. While some resource managers have a generalized resource model, in nearly all modern systems, nodes are the only resource managed. Other resources, such as parallel file systems, are also necessary for jobs to make progress, but schedulers are blind to these resources. Facility staff can manually detect critical problems and manually hold jobs that need particular file systems, but this requires manual monitoring. Without human intervention, modern schedulers will happily run jobs whose required resources are not available. As a result, resources are wasted when IO-intensive jobs are scheduled on file systems with degraded performance.We introduce CanarIO, a tool for predicting the IO-sensitivity of HPC jobs and detecting IO-related performance degradation on HPC systems. CanarIO uses a set of "canary" IO probes run at regular intervals on the system. Using performance measurements from these jobs, CanarIO builds classifiers that can determine which jobs are IO-sensitive and when file system performance is degraded. We demonstrate the accuracy of our tool with a simulation of system execution using real HPC data. Specifically, we detect 37.5% of IO degradation events and correctly identify >90% of IO-sensitive jobs. We show that with CanarIO predictions we recover >1,500 node-hours in 10 days, with a potential maximum of nearly 10,000 node-hours. CanarIO is the first step necessary for augmenting schedulers to be resource-aware.

查看原文本刊更多论文

CanarIO:发出io相关性能下降的警报

用户通过批处理系统与高性能计算(HPC)机器交互，批处理系统接收用户提交的作业并将其分配给计算资源。虽然一些资源管理器具有一般化的资源模型，但在几乎所有现代系统中，节点是唯一被管理的资源。其他资源，如并行文件系统，也是作业取得进展所必需的，但是调度器对这些资源视而不见。设施工作人员可以手动检测关键问题并手动维护需要特定文件系统的作业，但这需要手动监控。在没有人为干预的情况下，现代调度器将愉快地运行所需资源不可用的作业。因此，当在性能下降的文件系统上调度io密集型作业时，会浪费资源。我们介绍了CanarIO，一个用于预测高性能计算作业的io敏感性和检测高性能计算系统上与io相关的性能下降的工具。CanarIO使用一组在系统上定期运行的“金丝雀”IO探测器。使用来自这些作业的性能度量，CanarIO构建了分类器，可以确定哪些作业对io敏感，以及何时文件系统性能下降。我们使用真实的HPC数据对系统执行进行了仿真，以证明我们的工具的准确性。具体来说，我们检测到了37.5%的IO降级事件，并正确识别了90%的IO敏感作业。我们表明，使用CanarIO预测，我们在10天内恢复了bb0 1,500个节点小时，最大可能接近10,000个节点小时。CanarIO是增强调度器实现资源感知所必需的第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量