Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

IF 0.9 Q3 COMPUTER SCIENCE, THEORY & METHODS
A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun
{"title":"Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors","authors":"A. Benoit, Aurélien Cavelan, Y. Robert, Hongyang Sun","doi":"10.1145/2897189","DOIUrl":null,"url":null,"abstract":"In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"85 1","pages":"13:1-13:36"},"PeriodicalIF":0.9000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 34

Abstract

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.
评估处理故障停止和静默错误的通用算法
在本文中,我们将传统的检查点和回滚恢复策略与验证机制结合起来,以处理故障停止和静默错误。目标是最小化完工时间和/或能源消耗。对于可分负载应用程序,我们使用一阶近似来找到最佳检查点周期以最小化执行时间,并使用额外的验证机制在每个检查点之前检测沉默错误,从而扩展了Young和Daly的经典公式,仅适用于失败停止错误。我们进一步扩展了该方法,以包括中间验证,并考虑涉及时间和能量的双标准问题(执行时间和能量消耗的线性组合)。然后,我们重点研究了应用程序工作流,其依赖图是一个线性的任务链。在这里,我们为双标准问题确定最优检查点和验证位置,有或没有中间验证。与在整个执行过程中使用单一速度不同,我们进一步引入了一个新的执行场景,该场景允许通过动态电压和频率缩放(DVFS)改变执行速度。在后一种场景中,我们确定最佳检查点和验证位置,以及任意两个连续检查点之间每个任务段的最佳速度对。最后,我们进行了一组广泛的仿真来支持理论研究,并评估了每种算法的性能,结果表明,在使用中间验证和不同速度的最灵活场景下,可以获得最佳的整体性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing COMPUTER SCIENCE, THEORY & METHODS-
CiteScore
4.10
自引率
0.00%
发文量
16
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信