REPAIR: Hard-error recovery via re-execution

2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS) Pub Date : 2015-11-09 DOI:10.1109/DFT.2015.7315139

Jyothish Soman, Negar Miralaei, A. Mycroft, Timothy M. Jones

引用次数: 2

Abstract

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.

查看原文本刊更多论文

修复:通过重新执行恢复硬错误

在即将到来的技术节点上，处理器的可靠性给设计人员带来了巨大的挑战，包括制造变异性、参数变化和晶体管损耗导致的永久性故障。我们提出了一种在微架构层面上容忍这种影响的设计——一种带有n个核以及一个或多个共享指令重执行单元(iru)的芯片。使用故障组件的指令被识别并在IRU上重新执行。这种设计在没有错误的情况下不会导致减速，并且允许在我们方案保护的结构中的一个或所有核心发生多次硬错误后继续运行所有n个核心。实验表明，单核芯片在出现1个错误时仅会出现23%的减速，而在出现5个错误时则会上升到43%。在4核情况下，每个核上有4个错误和共享IRU, REPAIR使性能达到全功能系统的0.68倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)

自引率

0.00%

发文量