Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation

L. Bautista-Gomez, F. Cappello
{"title":"Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation","authors":"L. Bautista-Gomez, F. Cappello","doi":"10.1109/CLUSTER.2015.108","DOIUrl":null,"url":null,"abstract":"High-performance computing is a powerful tool that allows scientists to study complex natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher performance compared with that of current systems. However, power constrains in future exascale systems might limit the level of resilience of those machines. In particular, data could get corrupted silently, that is, without the hardware detecting the corruption. This situation is clearly unacceptable: simulation results must be within the error margin specified by the user. In this paper, we exploit multivariate interpolation in order to detect and correct data corruption in stencil applications. We evaluate this technique with a turbulent fluid application, and we demonstrate that the prediction error using multivariate interpolation is on the order of 0.01. Our results show that this mechanism can detect and correct most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute. In addition, we stress test the detector by injecting more than ten corruptions per minute and observe that our strategy allows the application to produce results with an error deviation under 10% in such a stressful scenario.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"1947 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 43

Abstract

High-performance computing is a powerful tool that allows scientists to study complex natural phenomena. Extreme-scale supercomputers promise orders of magnitude higher performance compared with that of current systems. However, power constrains in future exascale systems might limit the level of resilience of those machines. In particular, data could get corrupted silently, that is, without the hardware detecting the corruption. This situation is clearly unacceptable: simulation results must be within the error margin specified by the user. In this paper, we exploit multivariate interpolation in order to detect and correct data corruption in stencil applications. We evaluate this technique with a turbulent fluid application, and we demonstrate that the prediction error using multivariate interpolation is on the order of 0.01. Our results show that this mechanism can detect and correct most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute. In addition, we stress test the detector by injecting more than ten corruptions per minute and observe that our strategy allows the application to produce results with an error deviation under 10% in such a stressful scenario.
通过多元插值检测和纠正模板应用中的数据损坏
高性能计算是一种强大的工具,可以让科学家研究复杂的自然现象。与现有系统相比,超大规模超级计算机的性能有望提高几个数量级。然而,未来百亿亿级系统的功率限制可能会限制这些机器的弹性水平。特别是,数据可能会被悄无声息地损坏,也就是说,硬件不会检测到损坏。这种情况显然是不可接受的:模拟结果必须在用户指定的误差范围内。在本文中,我们利用多元插值来检测和纠正模板应用中的数据损坏。我们用紊流应用对该技术进行了评估,并证明了使用多元插值的预测误差在0.01量级。我们的结果表明,该机制可以检测和纠正最重要的损坏,并在整个执行过程中将误差偏差保持在1%以下,同时每分钟注入一次损坏。此外,我们对检测器进行了压力测试,每分钟注入10次以上的破坏,并观察到我们的策略允许应用程序在这种压力场景下产生误差偏差低于10%的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信