Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip

Qiaoyan Yu, José Cano, J. Flich, P. Ampadu
{"title":"Transient and Permanent Error Control for High-End Multiprocessor Systems-on-Chip","authors":"Qiaoyan Yu, José Cano, J. Flich, P. Ampadu","doi":"10.1109/NOCS.2012.27","DOIUrl":null,"url":null,"abstract":"High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.","PeriodicalId":6333,"journal":{"name":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","volume":"35 1","pages":"169-176"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NOCS.2012.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

High-end MPSoC systems with built-in high-radix topologies achieve good performance because of the improved connectivity and the reduced network diameter. In high-end MPSoC systems, fault tolerance support is becoming a compulsory feature. In this work, we propose a combined method to address permanent and transient link and router failures in those systems. The LBDRhr mechanism is proposed to tolerate permanent link failures in some popular high-radix topologies. The increased router complexity may lead to more transient router errors than routers using simple XY routing algorithm. We exploit the inherent information redundancy (IIR) in LBDRhr logic to manage transient errors in the network routers. Thorough analyses are provided to discover the appropriate internal nodes and the forbidden signal patterns for transient error detection. Simulation results show that LBDRhr logic can tolerate all of the permanent failure combinations of long-range links and 80% of links failures at short-range links. Case studies show that the error detection method based on the new IIR extraction method reduces the power consumption and the residual error rate by 33% and up to two orders of magnitude, respectively, compared to triple modular redundancy. The impact of network topologies on the efficiency of the detection mechanism has been examined in this work, as well.
高端多处理器片上系统的瞬态和永久错误控制
具有内置高基数拓扑的高端MPSoC系统由于连接性的改善和网络直径的减小而获得了良好的性能。在高端MPSoC系统中,容错支持正成为一项强制性功能。在这项工作中,我们提出了一种组合方法来解决这些系统中的永久和瞬态链路和路由器故障。提出了LBDRhr机制来容忍一些流行的高基数拓扑中的永久链路故障。与使用简单XY路由算法的路由器相比,路由器复杂性的增加可能会导致更多的路由器瞬态错误。我们利用LBDRhr逻辑中的固有信息冗余(IIR)来管理网络路由器中的瞬态错误。通过深入的分析,找出合适的内部节点和用于瞬态错误检测的禁止信号模式。仿真结果表明,LBDRhr逻辑可以容忍远程链路的所有永久故障组合和80%的短程链路故障组合。实例研究表明,与三模冗余相比,基于新的IIR提取方法的误差检测方法的功耗和剩余错误率分别降低了33%和两个数量级。网络拓扑结构对检测机制效率的影响也在这项工作中进行了研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信