Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors

Xiaobin Li, J. Gaudiot
{"title":"Design Trade-Offs and Deadlock Prevention in Transient Fault-Tolerant SMT Processors","authors":"Xiaobin Li, J. Gaudiot","doi":"10.1109/PRDC.2006.25","DOIUrl":null,"url":null,"abstract":"Since the very concept of simultaneous multi-threading (SMT) entails inherent redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, for the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby resulting in a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this concept: (i) the design trade-off between the fault detection coverage versus the design costs; (ii) the possible occurrence of deadlock situations. To achieve the largest possible fault detection coverage, we replicate the instructions fetched in order to generate the redundant thread copies. Further, we apply the SMT thread scheduling at the instruction dispatch stage so as to lower the performance overhead. As a result, when compared to the baseline processor, our simulation results show that by using our two new schemes, the performance overhead can be reduced down to as little as 34% on the average, down from 42%. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with one another, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that allocating some entries of ROB, LQ, and SQ for the trailing thread is sufficient to prevent such deadlocks","PeriodicalId":314915,"journal":{"name":"2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2006.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Since the very concept of simultaneous multi-threading (SMT) entails inherent redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, for the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby resulting in a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this concept: (i) the design trade-off between the fault detection coverage versus the design costs; (ii) the possible occurrence of deadlock situations. To achieve the largest possible fault detection coverage, we replicate the instructions fetched in order to generate the redundant thread copies. Further, we apply the SMT thread scheduling at the instruction dispatch stage so as to lower the performance overhead. As a result, when compared to the baseline processor, our simulation results show that by using our two new schemes, the performance overhead can be reduced down to as little as 34% on the average, down from 42%. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with one another, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that allocating some entries of ROB, LQ, and SQ for the trailing thread is sufficient to prevent such deadlocks
瞬态容错SMT处理器的设计权衡与死锁预防
由于同步多线程(SMT)的概念本身就需要固有的冗余,因此有人建议在SMT平台上运行同一线程的两个副本,以便检测和纠正软错误。这允许在检测到错误时将处理器状态回滚到已知的安全点,然后重试指令,从而导致完全无错误的执行。本文重点讨论了该概念引入的两个关键实现问题:(i)故障检测覆盖率与设计成本之间的设计权衡;(ii)可能发生的僵局情况。为了实现最大的故障检测覆盖率,我们复制获取的指令,以生成冗余线程副本。此外,我们在指令调度阶段应用SMT线程调度,以降低性能开销。因此,当与基准处理器进行比较时,我们的模拟结果表明,通过使用我们的两种新方案,性能开销平均可以降低到34%,而不是42%。最后,在容错执行模式下,由于两个复制的线程相互协作,死锁情况可能非常常见。因此,我们提供了一个详细的死锁分析,然后得出结论,为跟踪线程分配一些ROB、LQ和SQ条目足以防止此类死锁
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信