A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique

Md. Mohsin Ali, P. Strazdins, B. Harding, M. Hegland, J. Larson
{"title":"A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique","authors":"Md. Mohsin Ali, P. Strazdins, B. Harding, M. Hegland, J. Larson","doi":"10.1109/HPCSim.2015.7237082","DOIUrl":null,"url":null,"abstract":"Applications performing ultra-large scale simulations via solving PDEs require very large computational systems for their timely solution. Studies have shown the rate of failure grows with the system size and these trends are likely to worsen in future machines as less reliable components are used to reduce the energy cost. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) is a cost-effective method for solving time-evolving PDEs, especially for higher-dimensional problems. It can also be easily modified to provide algorithm-based fault tolerance for these problems. In this paper, we show how the SGCT can produce a fault-tolerant version of the GENE gyrokinetic plasma application, which evolves a 5D complex density field over time. We use an alternate component grid combination formula to recover data from lost processes. User Level Failure Mitigation (ULFM) MPI is used to recover the processes, and our implementation is robust over multiple failures and recovery for both process and node failures. An acceptable degree of modification of the application is required. Results using the SGCT on two of the fields' dimensions show competitive execution times with acceptable error (within 0.1%), compared to the same simulation with a single full resolution grid. The benefits improve when the SGCT is used over three dimensions. Our experiments show that the GENE application can successfully recover from multiple process failures, and applying the SGCT the corresponding number of times minimizes the error for the lost sub-grids. Application recovery overhead via ULFM MPI increases from ~1.5s at 64 cores to ~5s at 2048 cores for a one-off failure. This compares favourably to using GENE's in-built checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the backtrack overhead. An analysis for a long-running application taking into account checkpoint backtrack times indicates a reduction in overhead of over an order of magnitude.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2015.7237082","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

Applications performing ultra-large scale simulations via solving PDEs require very large computational systems for their timely solution. Studies have shown the rate of failure grows with the system size and these trends are likely to worsen in future machines as less reliable components are used to reduce the energy cost. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) is a cost-effective method for solving time-evolving PDEs, especially for higher-dimensional problems. It can also be easily modified to provide algorithm-based fault tolerance for these problems. In this paper, we show how the SGCT can produce a fault-tolerant version of the GENE gyrokinetic plasma application, which evolves a 5D complex density field over time. We use an alternate component grid combination formula to recover data from lost processes. User Level Failure Mitigation (ULFM) MPI is used to recover the processes, and our implementation is robust over multiple failures and recovery for both process and node failures. An acceptable degree of modification of the application is required. Results using the SGCT on two of the fields' dimensions show competitive execution times with acceptable error (within 0.1%), compared to the same simulation with a single full resolution grid. The benefits improve when the SGCT is used over three dimensions. Our experiments show that the GENE application can successfully recover from multiple process failures, and applying the SGCT the corresponding number of times minimizes the error for the lost sub-grids. Application recovery overhead via ULFM MPI increases from ~1.5s at 64 cores to ~5s at 2048 cores for a one-off failure. This compares favourably to using GENE's in-built checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the backtrack overhead. An analysis for a long-running application taking into account checkpoint backtrack times indicates a reduction in overhead of over an order of magnitude.
基于稀疏网格组合技术的容错回旋动力学等离子体应用
通过求解偏微分方程执行超大规模模拟的应用需要非常大的计算系统才能及时解决问题。研究表明,随着系统规模的扩大,故障率也在增长,而且随着为了降低能源成本而使用不太可靠的部件,未来的机器可能会出现这种趋势。因此,随着系统和在其上解决的问题不断增长,在失败中生存的能力正成为算法开发的一个关键方面。稀疏网格组合技术(SGCT)是求解时间演化偏微分方程的一种经济有效的方法,尤其适用于高维问题。它也可以很容易地修改为这些问题提供基于算法的容错。在本文中,我们展示了SGCT如何产生基因回旋动力学等离子体应用的容错版本,该应用随着时间的推移演变为5D复杂密度场。我们使用交替组件网格组合公式从丢失的进程中恢复数据。用户级故障缓解(ULFM) MPI用于恢复流程,我们的实现对于多个故障和流程和节点故障的恢复都是健壮的。需要对申请进行可接受程度的修改。与使用单个全分辨率网格的相同模拟相比,在两个字段的维度上使用SGCT的结果显示,执行时间具有可接受的误差(在0.1%以内)。当SGCT在三维空间上使用时,其好处会得到改善。我们的实验表明,GENE应用可以成功地从多个过程故障中恢复,并且应用相应次数的SGCT可以最大限度地减少丢失子网格的误差。对于一次性故障,通过ULFM MPI的应用程序恢复开销从64核时的1.5秒增加到2048核时的5秒。这比在故障时使用GENE的内置检查点与作业重新启动相结合的传统SGCT更有利,后者的开销是单个故障的四倍,不包括回溯开销。考虑检查点回溯时间的长时间运行应用程序的分析表明,开销减少了一个数量级以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信