Towards scalable reliability frameworks for error prone CMPs

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2009-10-11 DOI:10.1145/1629395.1629432

Joseph Sloan, Rakesh Kumar

{"title":"Towards scalable reliability frameworks for error prone CMPs","authors":"Joseph Sloan, Rakesh Kumar","doi":"10.1145/1629395.1629432","DOIUrl":null,"url":null,"abstract":"As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1629395.1629432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.

查看原文本刊更多论文

面向易出错cmp的可扩展可靠性框架

随着技术规模的扩大和计算能量不断接近热平衡[1,2]，参数变化和噪声水平将导致计算堆栈各个级别的错误率更大。对于后cmos和纳米电子系统以及概率[3]和随机架构[4]，错误率将特别高。核心级n模冗余(NMR)被提出作为实现多核体系结构系统可靠性目标的一种方法。虽然核心级DMR和TMR已被证明在错误很少的情况下是有效的，但面对高错误率时，要实现系统可靠性目标，将需要大量的核心级冗余。这使得投票延迟和带宽成为这类系统的重要性能瓶颈。在本文中，我们提出了一个可扩展的核磁共振框架，用于容易出错的芯片多处理器(cmp)。该框架支持网络内容错，将投票逻辑集成到路由器中，以允许真正的分布式投票。网络内容错路由器利用投票消息中的预期冗余来减少在leader上产生的一些阻塞开销，并提供一种权衡网络带宽和延迟的机制。我们的框架还支持主动检查点释放，允许参与投票的核心继续执行，而不是等待投票逻辑的通知。最后，该框架支持动态构造，允许该芯片上的任意核心成为NMR组的一部分。这允许绕过有故障的内核以及调度性能。我们的实验表明，这些优化带来了显著的性能/带宽优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems

自引率

0.00%

发文量