Assessing the Impact of Partial Verifications against Silent Data Corruptions

2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI:10.1109/ICPP.2015.53

Aurélien Cavelan, S. Raina, Y. Robert, Hongyang Sun

{"title":"Assessing the Impact of Partial Verifications against Silent Data Corruptions","authors":"Aurélien Cavelan, S. Raina, Y. Robert, Hongyang Sun","doi":"10.1109/ICPP.2015.53","DOIUrl":null,"url":null,"abstract":"Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, check pointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less accurate verifications in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to the first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Performance evaluations based on a wide range of parameters confirm the benefit of using partial verifications under certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 44th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2015.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Silent errors, or silent data corruptions, constitute a major threat on very large scale platforms. When a silent error strikes, it is not detected immediately but only after some delay, which prevents the use of pure periodic check pointing approaches devised for fail-stop errors. Instead, check pointing must be coupled with some verification mechanism to guarantee that corrupted data will never be written into the checkpoint file. Such a guaranteed verification mechanism typically incurs a high cost. In this paper, we assess the impact of using partial verification mechanisms in addition to a guaranteed verification. The main objective is to investigate to which extent it is worthwhile to use some light cost but less accurate verifications in the middle of a periodic computing pattern, which ends with a guaranteed verification right before each checkpoint. Introducing partial verifications dramatically complicates the analysis, but we are able to analytically determine the optimal computing pattern (up to the first-order approximation), including the optimal length of the pattern, the optimal number of partial verifications, as well as their optimal positions inside the pattern. Performance evaluations based on a wide range of parameters confirm the benefit of using partial verifications under certain scenarios, when compared to the baseline algorithm that uses only guaranteed verifications.

查看原文本刊更多论文

评估部分验证对静默数据损坏的影响

在非常大规模的平台上，静默错误或静默数据损坏构成了主要威胁。当无声错误发生时，它不会立即被检测到，而是在一段延迟之后才被检测到，这阻止了为故障停止错误设计的纯周期性检查指向方法的使用。相反，检查点必须与某种验证机制相结合，以保证损坏的数据永远不会写入检查点文件。这种有保证的验证机制通常会产生很高的成本。在本文中，我们评估了除了保证验证之外使用部分验证机制的影响。主要目标是研究在周期性计算模式中使用一些低成本但不太准确的验证在多大程度上是值得的，这种模式在每个检查点之前以有保证的验证结束。引入部分验证极大地使分析变得复杂，但是我们能够分析地确定最佳计算模式(直到一阶近似)，包括模式的最佳长度、部分验证的最佳数量，以及它们在模式中的最佳位置。与仅使用保证验证的基线算法相比，基于广泛参数的性能评估确认了在某些场景下使用部分验证的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 44th International Conference on Parallel Processing

自引率

0.00%

发文量