Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI:10.1145/3578178.3578220

Vahid Jafari, Philipp Neumann

{"title":"Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations","authors":"Vahid Jafari, Philipp Neumann","doi":"10.1145/3578178.3578220","DOIUrl":null,"url":null,"abstract":"Molecular dynamics (MD) simulations exhibit big computational efforts, which makes them very time-consuming. This particularly holds for molecular-continuum simulations in fluid dynamics, which rely on the simulation of MD ensembles that are coupled to computational fluid dynamics (CFD) solvers. Massively parallel implementations for MD simulations and the respective ensembles are therefore of utmost importance. However, the more processors are used for the molecular-continuum simulation, the higher the probability of software- and hardware-induced failures or malfunctions of one processor becomes, which may lead to the issue that the entire simulation crashes. To avoid long re-calculation times for the simulation, a fault tolerance mechanism is required, especially considering respective simulations carried out at the exascale. In this paper, we introduce a fault tolerance method for molecular-continuum simulations implemented in the macro-micro-coupling tool (MaMiCo), an open-source coupling tool for such multiscale simulations which allows the re-use of one’s favorite MD and CFD solvers. The method makes use of a dynamic ensemble handling approach that has been used previously to estimate statistical errors due to thermal fluctuations in the MD ensemble. The dynamic ensemble is always homogeneously distributed and, thus, balanced on the computational resources to minimize the overall induced overhead overhead. The method further relies on an MPI implementation with fault tolerance support. We report scalability results with and without modeled system failures on three TOP500 supercomputers—Fugaku/RIKEN with ARM technology, Hawk/HLRS with AMD EPYC technology and HSUper/Helmut Schmidt University with Intel Icelake processors—to demonstrate the feasibility of our approach.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"45 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578178.3578220","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Molecular dynamics (MD) simulations exhibit big computational efforts, which makes them very time-consuming. This particularly holds for molecular-continuum simulations in fluid dynamics, which rely on the simulation of MD ensembles that are coupled to computational fluid dynamics (CFD) solvers. Massively parallel implementations for MD simulations and the respective ensembles are therefore of utmost importance. However, the more processors are used for the molecular-continuum simulation, the higher the probability of software- and hardware-induced failures or malfunctions of one processor becomes, which may lead to the issue that the entire simulation crashes. To avoid long re-calculation times for the simulation, a fault tolerance mechanism is required, especially considering respective simulations carried out at the exascale. In this paper, we introduce a fault tolerance method for molecular-continuum simulations implemented in the macro-micro-coupling tool (MaMiCo), an open-source coupling tool for such multiscale simulations which allows the re-use of one’s favorite MD and CFD solvers. The method makes use of a dynamic ensemble handling approach that has been used previously to estimate statistical errors due to thermal fluctuations in the MD ensemble. The dynamic ensemble is always homogeneously distributed and, thus, balanced on the computational resources to minimize the overall induced overhead overhead. The method further relies on an MPI implementation with fault tolerance support. We report scalability results with and without modeled system failures on three TOP500 supercomputers—Fugaku/RIKEN with ARM technology, Hawk/HLRS with AMD EPYC technology and HSUper/Helmut Schmidt University with Intel Icelake processors—to demonstrate the feasibility of our approach.

查看原文本刊更多论文

基于集成的分子连续流模拟容错

分子动力学(MD)模拟需要大量的计算量，这使得它们非常耗时。这尤其适用于流体动力学中的分子连续体模拟，它依赖于与计算流体动力学(CFD)求解器耦合的MD集成模拟。因此，大规模并行实现MD模拟和各自的集成是至关重要的。然而，用于分子连续体模拟的处理器越多，软件和硬件引起的故障或单个处理器故障的概率就越高，这可能导致整个模拟崩溃的问题。为了避免模拟的长时间重新计算，需要容错机制，特别是考虑到在百亿亿次上执行的各自模拟。在本文中，我们介绍了一种在宏微观耦合工具(MaMiCo)中实现的分子连续体模拟容错方法，这是一种用于此类多尺度模拟的开源耦合工具，允许重用您最喜欢的MD和CFD求解器。该方法利用了一种动态系综处理方法，这种方法以前曾用于估计MD系综中由于热波动引起的统计误差。动态集成总是均匀分布的，因此在计算资源上是平衡的，以最小化总体诱导开销。该方法进一步依赖于具有容错支持的MPI实现。我们报告了在三台TOP500超级计算机(fugaku /RIKEN采用ARM技术，Hawk/HLRS采用AMD EPYC技术，HSUper/Helmut Schmidt University采用英特尔冰岛处理器)上有和没有模拟系统故障的可扩展性结果，以证明我们方法的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量