FaultHound: Value-locality-based soft-fault tolerance

Nitin, I. Pomeranz, T. N. Vijaykumar
{"title":"FaultHound: Value-locality-based soft-fault tolerance","authors":"Nitin, I. Pomeranz, T. N. Vijaykumar","doi":"10.1145/2749469.2750372","DOIUrl":null,"url":null,"abstract":"Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"668-681"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).
FaultHound:基于值-位置的软容错
随着CMOS的持续扩展,软误差敏感性日益受到关注。以前的工作探讨了软容错的硬件和软件的全冗余和部分冗余方案。然而,全冗余方案带来高性能和能源开销,而部分冗余方案实现低覆盖率。一项最初的研究,称为基于扰动的故障筛选(PBFS),探索利用值局域性来提供软故障的提示,每当一个值落在其邻域之外。PBFS采用位掩码过滤器捕获值邻域。然而,PBFS实现了低覆盖率;直接提高覆盖率会导致高误报率,以及性能和能源开销。本文提出了一种基于值-位置的软容错方案FaultHound,该方案采用五种机制来解决PBFS的局限性:(1)通过过滤器表的反向组织对过滤器进行聚类,以加强学习并降低误报率;(2)忽略多次误报的错误位的学习方案,进一步降低误报率;(3)采用轻量级前导重放方案代替完全回滚,以减少剩余误报的性能和能量损失;(4)区分重命名错误(需要回滚而不是重放恢复)和误报的简单方案,以避免不必要的回滚惩罚;(5)一种避免回滚的检测方案,用于重播不包括的负载存储队列。通过模拟,我们表明,虽然PBFS在高性能开销(97%)下实现了低覆盖率(30%)或高假阳性率(8%),但FaultHound在较低的性能和能量开销(10%和25%)下实现了更高的覆盖率(75%)和较低的假阳性率(3%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信