修复:通过重新执行恢复硬错误

2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS) Pub Date : 2015-11-09 DOI:10.1109/DFT.2015.7315139

Jyothish Soman, Negar Miralaei, A. Mycroft, Timothy M. Jones

{"title":"修复:通过重新执行恢复硬错误","authors":"Jyothish Soman, Negar Miralaei, A. Mycroft, Timothy M. Jones","doi":"10.1109/DFT.2015.7315139","DOIUrl":null,"url":null,"abstract":"Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.","PeriodicalId":383972,"journal":{"name":"2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)","volume":"167 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"REPAIR: Hard-error recovery via re-execution\",\"authors\":\"Jyothish Soman, Negar Miralaei, A. Mycroft, Timothy M. Jones\",\"doi\":\"10.1109/DFT.2015.7315139\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.\",\"PeriodicalId\":383972,\"journal\":{\"name\":\"2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)\",\"volume\":\"167 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DFT.2015.7315139\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFT.2015.7315139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在即将到来的技术节点上，处理器的可靠性给设计人员带来了巨大的挑战，包括制造变异性、参数变化和晶体管损耗导致的永久性故障。我们提出了一种在微架构层面上容忍这种影响的设计——一种带有n个核以及一个或多个共享指令重执行单元(iru)的芯片。使用故障组件的指令被识别并在IRU上重新执行。这种设计在没有错误的情况下不会导致减速，并且允许在我们方案保护的结构中的一个或所有核心发生多次硬错误后继续运行所有n个核心。实验表明，单核芯片在出现1个错误时仅会出现23%的减速，而在出现5个错误时则会上升到43%。在4核情况下，每个核上有4个错误和共享IRU, REPAIR使性能达到全功能系统的0.68倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

REPAIR: Hard-error recovery via re-execution

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)

自引率

0.00%

发文量