ParaStack: Efficient Hang Detection for MPI Programs at Large Scale

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2017-11-12 DOI:10.1145/3126908.3126938

Hongbo Li, Zizhong Chen, Rajiv Gupta

{"title":"ParaStack: Efficient Hang Detection for MPI Programs at Large Scale","authors":"Hongbo Li, Zizhong Chen, Rajiv Gupta","doi":"10.1145/3126908.3126938","DOIUrl":null,"url":null,"abstract":"While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout-too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world’s current 2nd and 12th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.","PeriodicalId":204241,"journal":{"name":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC17: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3126908.3126938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout-too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world’s current 2nd and 12th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.

查看原文本刊更多论文

ParaStack:大规模MPI程序的有效挂起检测

虽然大型并行系统上的程序挂起可以通过广泛使用的超时机制检测到，但用户很难设置超时——过小的超时会导致高的误报率，过大的超时会浪费大量宝贵的计算资源。为了解决挂起检测的上述问题，本文提出了ParaStack，这是一个非常轻量级的工具，可以以高精度及时检测挂起，开销可以忽略不计，具有很大的可扩展性，并且不需要用户选择超时间值。对于检测到的挂起，它通过告诉用户该挂起是计算阶段还是通信阶段的错误造成的，从而为进一步分析提供方向。对于由计算错误引起的挂起，我们的工具通过排除成百上千个其他进程来确定有问题的进程。我们已经调整了ParaStack与Torque和Slurm并行批调度程序一起工作，并在天河2号和Stampede上验证了它的功能和性能，这两台超级计算机分别是目前世界上速度最快的第2和第12台超级计算机。实验结果表明，ParaStack能够以可忽略的开销及时检测挂起，准确率超过99%。在256个进程的规模下，正确运行需要66小时，在1024个进程的规模下需要39.7小时，没有观察到错误警报。ParaStack准确地报告计算错误导致挂起的错误进程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC17: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量