{"title":"On Precise Fault Localization and Identification in NoC Architectures","authors":"M. Stáva","doi":"10.1109/DSD.2019.00075","DOIUrl":null,"url":null,"abstract":"For network-on-chip (NoC), this paper presents a novel online fault-tolerance method based on precise fault localization and identification. We introduce a concept of distinguishing between intra-switch path faults, a concept of retransmission credit as a method of distinguishing between permanent and transient faults, and a concept of long transient recovery timeout as a method of distinguishing between short and long (or burst of) transient faults. Another concept of monitoring errors separately on links and switches is also employed. The fault-tolerance concepts introduced bring the higher performance of NoCs in comparison to existing error recovery schemes. Experimental results show the performance and resource utilization of the proposed NoC error recovery scheme.","PeriodicalId":217233,"journal":{"name":"2019 22nd Euromicro Conference on Digital System Design (DSD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 22nd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD.2019.00075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
For network-on-chip (NoC), this paper presents a novel online fault-tolerance method based on precise fault localization and identification. We introduce a concept of distinguishing between intra-switch path faults, a concept of retransmission credit as a method of distinguishing between permanent and transient faults, and a concept of long transient recovery timeout as a method of distinguishing between short and long (or burst of) transient faults. Another concept of monitoring errors separately on links and switches is also employed. The fault-tolerance concepts introduced bring the higher performance of NoCs in comparison to existing error recovery schemes. Experimental results show the performance and resource utilization of the proposed NoC error recovery scheme.