[2009] A Stage-Level Recovery Scheme in Scalable Pipeline Modules for High Dependability

2010 International Workshop on Innovative Architecture for Future Generation High Performance Pub Date : 2010-01-17 DOI:10.1109/IWIA.2010.11

Jun Yao, Hajime Shimada, Kazutoshi Kobayashi

{"title":"[2009] A Stage-Level Recovery Scheme in Scalable Pipeline Modules for High Dependability","authors":"Jun Yao, Hajime Shimada, Kazutoshi Kobayashi","doi":"10.1109/IWIA.2010.11","DOIUrl":null,"url":null,"abstract":"In the recent years, the increasing error rate has become one of the major impediments for the application of new process technologies in electronic devices like microprocessors. This thereby necessitates the research of fault toleration mechanisms from all device, micro-architecture and system levels to keep correct computation in future microprocessors, along the advances of process technologies.Space redundancy, as dual or triple modular redundancy (DMR or TMR), is widely used to tolerate errors with a negligible performance loss. In this paper, at the micro-architecture level, we propose a very fine-grained recovery scheme based on a DMR processor architecture to cover every transient error inside of the memory interface boundary. Our recovery method makes full use of the existing duplicated hardware in the DMR processor, which can avoid large hardware extension by not using checkpoint buffers in many fault-tolerable processors. The hardware-based recovery is achieved by dynamically triggering an instruction re-execution procedure in the next cycle after error detection, which indicates a near-zero performance impact to achieve an error-free execution.A TMR architecture is usually preferred as it provides a seamless error correction by a majority voting logic and therefore generates no recovery delay. With our fast recovery scheme at a low hardware cost, our result shows that even under a relatively high transient error rate, it is possible to only use a DMR architecture to detect/recover errors at a negligible performance cost. Our reliable processor is thus constructed to use a DMR execution with the fast recovery as its major working mode. It saves around 1/3 energy consumption from a traditional TMR architecture, while the transient error coverage is still maintained.","PeriodicalId":339844,"journal":{"name":"2010 International Workshop on Innovative Architecture for Future Generation High Performance","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 International Workshop on Innovative Architecture for Future Generation High Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWIA.2010.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

In the recent years, the increasing error rate has become one of the major impediments for the application of new process technologies in electronic devices like microprocessors. This thereby necessitates the research of fault toleration mechanisms from all device, micro-architecture and system levels to keep correct computation in future microprocessors, along the advances of process technologies.Space redundancy, as dual or triple modular redundancy (DMR or TMR), is widely used to tolerate errors with a negligible performance loss. In this paper, at the micro-architecture level, we propose a very fine-grained recovery scheme based on a DMR processor architecture to cover every transient error inside of the memory interface boundary. Our recovery method makes full use of the existing duplicated hardware in the DMR processor, which can avoid large hardware extension by not using checkpoint buffers in many fault-tolerable processors. The hardware-based recovery is achieved by dynamically triggering an instruction re-execution procedure in the next cycle after error detection, which indicates a near-zero performance impact to achieve an error-free execution.A TMR architecture is usually preferred as it provides a seamless error correction by a majority voting logic and therefore generates no recovery delay. With our fast recovery scheme at a low hardware cost, our result shows that even under a relatively high transient error rate, it is possible to only use a DMR architecture to detect/recover errors at a negligible performance cost. Our reliable processor is thus constructed to use a DMR execution with the fast recovery as its major working mode. It saves around 1/3 energy consumption from a traditional TMR architecture, while the transient error coverage is still maintained.

查看原文本刊更多论文

[2009]高可靠性可扩展管道模块的阶段级恢复方案

近年来，不断增加的错误率已成为微处理器等电子器件中新工艺技术应用的主要障碍之一。因此，随着工艺技术的进步，有必要从所有设备、微架构和系统层面研究容错机制，以保持未来微处理器的正确计算。空间冗余，作为双或三模冗余(DMR或TMR)，被广泛用于容错，而性能损失可以忽略不计。在微体系结构层面，我们提出了一种基于DMR处理器体系结构的细粒度恢复方案，以覆盖内存接口边界内的每一个瞬态错误。我们的恢复方法充分利用了DMR处理器中已有的重复硬件，避免了在多个容错处理器中使用检查点缓冲区，从而避免了大量的硬件扩展。基于硬件的恢复是通过在错误检测后的下一个周期中动态触发指令重新执行过程来实现的，这表明实现无错误执行对性能的影响几乎为零。TMR架构通常是首选，因为它通过多数投票逻辑提供了无缝的错误纠正，因此不会产生恢复延迟。使用我们的低硬件成本的快速恢复方案，我们的结果表明，即使在相对较高的瞬态错误率下，也可以仅使用DMR架构以微不足道的性能成本检测/恢复错误。因此，我们可靠的处理器被构建为使用DMR执行，快速恢复作为其主要工作模式。它比传统的TMR架构节省了大约1/3的能耗，同时仍然保持了瞬态误差覆盖。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 International Workshop on Innovative Architecture for Future Generation High Performance

自引率

0.00%

发文量