平滑粒子流体动力学模拟中无声数据损坏的检测

Aurélien Cavelan, R. Cabezón, F. Ciorba
{"title":"平滑粒子流体动力学模拟中无声数据损坏的检测","authors":"Aurélien Cavelan, R. Cabezón, F. Ciorba","doi":"10.1109/CCGRID.2019.00013","DOIUrl":null,"url":null,"abstract":"Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on a HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations\",\"authors\":\"Aurélien Cavelan, R. Cabezón, F. Ciorba\",\"doi\":\"10.1109/CCGRID.2019.00013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on a HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.\",\"PeriodicalId\":234571,\"journal\":{\"name\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2019.00013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

静默数据损坏(sdc)阻碍了大规模计算系统上长期运行的科学应用程序的正确性。本文提出了选择性粒子复制(SPR)方法,这是光滑粒子流体力学(SPH)模拟中第一个基于粒子复制的SDCs检测方法。SPH是一种无网格拉格朗日方法,通常用于天体物理学和计算流体动力学中进行流体动力学模拟。SPH对相邻离散点(称为SPH粒子)的物理性质进行插值,动态地调整其分布以适应流体的质量密度场。当一个错误(例如,位翻转)打击计算或与粒子相关的数据时,由此产生的错误通过这样的插值步骤无声地传播到所有最近的邻居。SPR复制了一些精心挑选的SPH粒子的计算和数据。当一个粒子的数据由于损坏而与其复制的对应粒子不同时,就会检测到SDCs。通过确保所有粒子至少有一个被复制的邻居,SPR能够在传播时检测到许多DRAM sdc。通过一组错误注入和检测实验来评估SPR的检测能力,并通过在高性能计算系统上进行的一组强尺度实验来评估SPR的开销。结果表明,SPR的检出率为91 ~ 99.9%,无假阳性,开销为1 ~ 10%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations
Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on a HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信