了解基于预测的部分冗余线程，以实现低开销、高覆盖率的容错性

ASPLOS XII Pub Date : 2006-10-23 DOI:10.1145/1168857.1168869

Vimal K. Reddy, E. Rotenberg, Sailashri Parthasarathy

{"title":"了解基于预测的部分冗余线程，以实现低开销、高覆盖率的容错性","authors":"Vimal K. Reddy, E. Rotenberg, Sailashri Parthasarathy","doi":"10.1145/1168857.1168869","DOIUrl":null,"url":null,"abstract":"Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates instructions only during periods of poor single-thread performance. (ii) ReStore does not explicitly duplicate instructions and instead exploits mispredictions among highly confident branch predictions as symptoms of faults. (iii) Slipstream creates a reduced alternate thread by replacing many instructions with highly confident predictions. We explore PRT as a possible direction for achieving the fault tolerance of full duplication with the performance of single-thread execution. Opportunistic and ReStore yield partial coverage since they are restricted to using only partial duplication or only confident predictions, respectively. Previous analysis of Slipstream fault tolerance was cursory and concluded that only duplicated instructions are covered. In this paper, we attempt to better understand Slipstream's fault tolerance, conjecturing that the mixture of partial duplication and confident predictions actually closely approximates the coverage of full duplication. A thorough dissection of prediction scenarios confirms that faults in nearly 100% of instructions are detectable. Fewer than 0.1% of faulty instructions are not detectable due to coincident faults and mispredictions. Next we show that the current recovery implementation fails to leverage excellent detection capability, since recovery sometimes initiates belatedly, after already retiring a detected faulty instruction. We propose and evaluate a suite of simple microarchitectural alterations to recovery and checking. Using the best alterations, Slipstream can recover from faults in 99% of instructions, compared to only 78% of instructions without alterations. Both results are much higher than predicted by past research, which claims coverage for only duplicated instructions, or 65% of instructions. On an 8-issue SMT processor, Slipstream performs within 1.3% of single-thread execution whereas full duplication slows performance by 14%.A key byproduct of this paper is a novel analysis framework in which every dynamic instruction is considered to be hypothetically faulty, thus not requiring explicit fault injection. Fault coverage is measured in terms of the fraction of candidate faulty instructions that are directly or indirectly detectable before.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":"{\"title\":\"Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance\",\"authors\":\"Vimal K. Reddy, E. Rotenberg, Sailashri Parthasarathy\",\"doi\":\"10.1145/1168857.1168869\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates instructions only during periods of poor single-thread performance. (ii) ReStore does not explicitly duplicate instructions and instead exploits mispredictions among highly confident branch predictions as symptoms of faults. (iii) Slipstream creates a reduced alternate thread by replacing many instructions with highly confident predictions. We explore PRT as a possible direction for achieving the fault tolerance of full duplication with the performance of single-thread execution. Opportunistic and ReStore yield partial coverage since they are restricted to using only partial duplication or only confident predictions, respectively. Previous analysis of Slipstream fault tolerance was cursory and concluded that only duplicated instructions are covered. In this paper, we attempt to better understand Slipstream's fault tolerance, conjecturing that the mixture of partial duplication and confident predictions actually closely approximates the coverage of full duplication. A thorough dissection of prediction scenarios confirms that faults in nearly 100% of instructions are detectable. Fewer than 0.1% of faulty instructions are not detectable due to coincident faults and mispredictions. Next we show that the current recovery implementation fails to leverage excellent detection capability, since recovery sometimes initiates belatedly, after already retiring a detected faulty instruction. We propose and evaluate a suite of simple microarchitectural alterations to recovery and checking. Using the best alterations, Slipstream can recover from faults in 99% of instructions, compared to only 78% of instructions without alterations. Both results are much higher than predicted by past research, which claims coverage for only duplicated instructions, or 65% of instructions. On an 8-issue SMT processor, Slipstream performs within 1.3% of single-thread execution whereas full duplication slows performance by 14%.A key byproduct of this paper is a novel analysis framework in which every dynamic instruction is considered to be hypothetically faulty, thus not requiring explicit fault injection. Fault coverage is measured in terms of the fraction of candidate faulty instructions that are directly or indirectly detectable before.\",\"PeriodicalId\":270694,\"journal\":{\"name\":\"ASPLOS XII\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"46\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ASPLOS XII\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1168857.1168869\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS XII","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1168857.1168869","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 46

摘要

冗余线程体系结构复制所有指令以检测并可能从瞬态故障中恢复。最近提出了几种轻量级的部分冗余线程(PRT)架构。(i)机会容错只在单线程性能差的时期重复指令。(ii) ReStore不明确地重复指令，而是利用高度自信的分支预测中的错误预测作为故障的症状。(iii)滑流通过用高度自信的预测取代许多指令来减少交替线程。我们探索了PRT作为实现单线程执行性能的完全复制容错的可能方向。机会主义和恢复产生部分覆盖，因为它们分别限于仅使用部分复制或仅使用可靠的预测。先前对滑流容错的分析是粗略的，得出的结论是只包括重复的指令。在本文中，我们试图更好地理解滑流的容错，推测部分复制和自信预测的混合实际上非常接近完全复制的覆盖范围。对预测场景的彻底剖析证实，几乎100%的指令中的错误都是可以检测到的。由于巧合错误和错误预测，不到0.1%的错误指令无法检测到。接下来，我们将展示当前的恢复实现无法利用出色的检测功能，因为恢复有时在已经淘汰检测到的错误指令之后才开始启动。我们提出并评估了一套简单的微架构更改来恢复和检查。使用最好的修改，Slipstream可以从99%的指令错误中恢复，相比之下，没有修改的指令只有78%。这两个结果都比过去的研究预测的要高得多，过去的研究声称只有重复指令的覆盖率，即65%的指令。在8个问题的SMT处理器上，Slipstream的单线程执行速度只有1.3%，而完全复制会使性能降低14%。本文的一个重要副产品是一个新的分析框架，其中每个动态指令都被认为是假设错误的，因此不需要显式的错误注入。故障覆盖率是根据之前直接或间接检测到的候选错误指令的比例来测量的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates instructions only during periods of poor single-thread performance. (ii) ReStore does not explicitly duplicate instructions and instead exploits mispredictions among highly confident branch predictions as symptoms of faults. (iii) Slipstream creates a reduced alternate thread by replacing many instructions with highly confident predictions. We explore PRT as a possible direction for achieving the fault tolerance of full duplication with the performance of single-thread execution. Opportunistic and ReStore yield partial coverage since they are restricted to using only partial duplication or only confident predictions, respectively. Previous analysis of Slipstream fault tolerance was cursory and concluded that only duplicated instructions are covered. In this paper, we attempt to better understand Slipstream's fault tolerance, conjecturing that the mixture of partial duplication and confident predictions actually closely approximates the coverage of full duplication. A thorough dissection of prediction scenarios confirms that faults in nearly 100% of instructions are detectable. Fewer than 0.1% of faulty instructions are not detectable due to coincident faults and mispredictions. Next we show that the current recovery implementation fails to leverage excellent detection capability, since recovery sometimes initiates belatedly, after already retiring a detected faulty instruction. We propose and evaluate a suite of simple microarchitectural alterations to recovery and checking. Using the best alterations, Slipstream can recover from faults in 99% of instructions, compared to only 78% of instructions without alterations. Both results are much higher than predicted by past research, which claims coverage for only duplicated instructions, or 65% of instructions. On an 8-issue SMT processor, Slipstream performs within 1.3% of single-thread execution whereas full duplication slows performance by 14%.A key byproduct of this paper is a novel analysis framework in which every dynamic instruction is considered to be hypothetically faulty, thus not requiring explicit fault injection. Fault coverage is measured in terms of the fraction of candidate faulty instructions that are directly or indirectly detectable before.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ASPLOS XII

自引率

0.00%

发文量