Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study

2013 IEEE International Test Conference (ITC) Pub Date : 2013-11-04 DOI:10.1109/TEST.2013.6651907

Yanjing Li, E. Cheng, S. Makar, S. Mitra

{"title":"Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study","authors":"Yanjing Li, E. Cheng, S. Makar, S. Mitra","doi":"10.1109/TEST.2013.6651907","DOIUrl":null,"url":null,"abstract":"Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.","PeriodicalId":6379,"journal":{"name":"2013 IEEE International Test Conference (ITC)","volume":"25 1","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Test Conference (ITC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TEST.2013.6651907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this paper, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 16%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.

查看原文本刊更多论文

在健壮的片上系统中非核心组件的自我修复:一个OpenSPARC T2案例研究

自我修复替换/绕过片上系统(SoC)中的故障组件，即使在存在永久故障的情况下也能保持系统正常运行。这些故障可能是由于早期寿命失效、电路老化、制造缺陷和变化造成的。与片上存储器、处理器内核和片上网络不同，很少有人关注占据多核soc重要部分的非核心组件(例如，缓存控制器、内存控制器和I/O控制器)的自我修复。在本文中，我们提出了利用架构特征实现非核心组件自我修复的新技术，同时产生低面积，功耗和性能成本。我们展示了我们的技术的有效性和实用性，使用工业OpenSPARC T2 SoC具有8个处理器内核，支持64个硬件线程。我们的主要结果是:1。我们的技术能够有效地自我修复任何单个故障的非核心组件，其布局后芯片级面积影响为7.5%，功率影响为3%。相比之下，现有的冗余技术带来了很高的面积成本(例如16%)。我们的技术不会对无故障系统产生任何性能影响。如果存在单个故障的非核心组件，则可能会对应用程序性能造成5%的影响。2. 我们的技术能够自我修复多个故障的非核心组件，而不会产生任何额外的面积影响，但应用程序性能会有很好的下降。3.我们的技术在存在单个故障的情况下实现了97.5%的高自修复覆盖率。我们的自我修复技术还可以在自我修复覆盖范围和面积成本之间进行灵活的权衡。例如，在3.2%的布局后芯片级面积影响下，可以实现75%的自我修复覆盖率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Test Conference (ITC)

自引率

0.00%

发文量