Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver

Karla Morris, F. Rizzi, Brendan Cook, Paul Mycek, O. Maître, O. Knio, K. Sargsyan, K. Dahlgren, B. Debusschere
{"title":"Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver","authors":"Karla Morris, F. Rizzi, Brendan Cook, Paul Mycek, O. Maître, O. Knio, K. Sargsyan, K. Dahlgren, B. Debusschere","doi":"10.1109/SCALA.2016.10","DOIUrl":null,"url":null,"abstract":"We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCALA.2016.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.
基于弹性ulfm的PDE求解器的性能、尺度可变性和能量分析
基于用户级故障缓解消息传递接口(ULFM-MPI),提出了一种基于弹性任务的偏微分方程域分解预调节器。该算法将PDE重新定义为一个采样问题,然后是一个基于鲁棒回归的解决方案更新,该解决方案可以适应静默数据损坏(sdc)。我们采用服务器-客户端模型,其中所有状态信息由服务器保存,而客户端仅作为计算单元。该算法基于任务的特性和ULFM的功能相互补充,以支持丢失的任务,从而使应用程序对客户机故障具有弹性。我们在Edison,国家能源研究科学计算中心(NERSC)上给出了标称和故障注入情况下的弱和强缩放结果,表明即使存在故障,在高达50k核的情况下测试的可扩展性在90%以内。然后,我们量化了由于断层的存在而导致的弱尺度和强尺度的可变性。最后,我们将从子域大小、服务器/客户端配置以及能量和弹性之间的相互作用等方面讨论应用程序的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信