Toward efficient check-pointing and rollback under on-demand SBST in chip multi-processors

M. Skitsas, C. Nicopoulos, M. Michael
{"title":"Toward efficient check-pointing and rollback under on-demand SBST in chip multi-processors","authors":"M. Skitsas, C. Nicopoulos, M. Michael","doi":"10.1109/IOLTS.2015.7229842","DOIUrl":null,"url":null,"abstract":"In-field on-line testing techniques have recently been proposed for permanent fault detection caused by wear-out/aging-related defects manifesting during the lifetime of a system. Selective Software-Based Self-Testing (SBST) is one such paradigm focusing primarily on the recently stressed functional units of a multicore system at a sub-core granularity, in an attempt to reduce the application performance penalty caused by periodically testing the entire system. In this work, we complement our O/S-enabled framework DeamonGuard for on-demand (selective) SBST to support fault recovery capabilities. Towards this goal, we propose an efficient check pointing and rollback recovery mechanism which, upon fault detection, can restore the system to the most recently valid correct state and resume the normal operation assuming disabling of the faulty core, thereby leading to a healthy (but degraded) system. The work in this paper concentrates on reducing the number of stored checkpoints required when testing at a sub-core granularity, and minimizing the recovery penalty of such framework. We evaluate and demonstrate the overhead of the proposed recovery mechanism, and our results indicate a practical reduction in the number of stored checkpoints as well as a significant improvement in recovery latency for the cases where the faults are correlated with the stressed units.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2015.7229842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In-field on-line testing techniques have recently been proposed for permanent fault detection caused by wear-out/aging-related defects manifesting during the lifetime of a system. Selective Software-Based Self-Testing (SBST) is one such paradigm focusing primarily on the recently stressed functional units of a multicore system at a sub-core granularity, in an attempt to reduce the application performance penalty caused by periodically testing the entire system. In this work, we complement our O/S-enabled framework DeamonGuard for on-demand (selective) SBST to support fault recovery capabilities. Towards this goal, we propose an efficient check pointing and rollback recovery mechanism which, upon fault detection, can restore the system to the most recently valid correct state and resume the normal operation assuming disabling of the faulty core, thereby leading to a healthy (but degraded) system. The work in this paper concentrates on reducing the number of stored checkpoints required when testing at a sub-core granularity, and minimizing the recovery penalty of such framework. We evaluate and demonstrate the overhead of the proposed recovery mechanism, and our results indicate a practical reduction in the number of stored checkpoints as well as a significant improvement in recovery latency for the cases where the faults are correlated with the stressed units.
芯片多处理器中按需SBST下的有效检查点和回滚
最近提出了现场在线测试技术,用于在系统使用寿命期间出现的磨损/老化相关缺陷引起的永久性故障检测。选择性软件自测试(Selective Software-Based Self-Testing, SBST)就是这样一种范例,主要关注多核系统在子核粒度上最近强调的功能单元,试图减少周期性测试整个系统所造成的应用程序性能损失。在这项工作中,我们为按需(选择性)SBST补充了支持O/ s的框架DeamonGuard,以支持故障恢复功能。为了实现这一目标,我们提出了一种有效的检查指向和回滚恢复机制,该机制在检测到故障后,可以将系统恢复到最近有效的正确状态,并在禁用故障核心的情况下恢复正常运行,从而导致系统健康(但降级)。本文的工作集中在减少在子核心粒度测试时所需的存储检查点的数量,并最大限度地减少这种框架的恢复损失。我们评估并演示了所提出的恢复机制的开销,我们的结果表明,在故障与应力单元相关的情况下,存储检查点的数量实际减少了,恢复延迟也有了显著改善。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信