Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors

2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06) Pub Date : 2006-12-18 DOI:10.1109/PRDC.2006.25

Xiaobin Li, J. Gaudiot

{"title":"Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors","authors":"Xiaobin Li, J. Gaudiot","doi":"10.1109/PRDC.2006.25","DOIUrl":null,"url":null,"abstract":"Since the very concept of simultaneous multi-threading (SMT) entails inherent redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, for the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby resulting in a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this concept: (i) the design trade-off between the fault detection coverage versus the design costs; (ii) the possible occurrence of deadlock situations. To achieve the largest possible fault detection coverage, we replicate the instructions fetched in order to generate the redundant thread copies. Further, we apply the SMT thread scheduling at the instruction dispatch stage so as to lower the performance overhead. As a result, when compared to the baseline processor, our simulation results show that by using our two new schemes, the performance overhead can be reduced down to as little as 34% on the average, down from 42%. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with one another, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that allocating some entries of ROB, LQ, and SQ for the trailing thread is sufficient to prevent such deadlocks","PeriodicalId":314915,"journal":{"name":"2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2006.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Since the very concept of simultaneous multi-threading (SMT) entails inherent redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, for the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby resulting in a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this concept: (i) the design trade-off between the fault detection coverage versus the design costs; (ii) the possible occurrence of deadlock situations. To achieve the largest possible fault detection coverage, we replicate the instructions fetched in order to generate the redundant thread copies. Further, we apply the SMT thread scheduling at the instruction dispatch stage so as to lower the performance overhead. As a result, when compared to the baseline processor, our simulation results show that by using our two new schemes, the performance overhead can be reduced down to as little as 34% on the average, down from 42%. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with one another, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that allocating some entries of ROB, LQ, and SQ for the trailing thread is sufficient to prevent such deadlocks

查看原文本刊更多论文

瞬态容错SMT处理器的设计权衡与死锁预防

由于同步多线程(SMT)的概念本身就需要固有的冗余，因此有人建议在SMT平台上运行同一线程的两个副本，以便检测和纠正软错误。这允许在检测到错误时将处理器状态回滚到已知的安全点，然后重试指令，从而导致完全无错误的执行。本文重点讨论了该概念引入的两个关键实现问题:(i)故障检测覆盖率与设计成本之间的设计权衡;(ii)可能发生的僵局情况。为了实现最大的故障检测覆盖率，我们复制获取的指令，以生成冗余线程副本。此外，我们在指令调度阶段应用SMT线程调度，以降低性能开销。因此，当与基准处理器进行比较时，我们的模拟结果表明，通过使用我们的两种新方案，性能开销平均可以降低到34%，而不是42%。最后，在容错执行模式下，由于两个复制的线程相互协作，死锁情况可能非常常见。因此，我们提供了一个详细的死锁分析，然后得出结论，为跟踪线程分配一些ROB、LQ和SQ条目足以防止此类死锁

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06)

自引率

0.00%

发文量