{"title":"Overview of fault handling for the chaos router","authors":"K. Bolding, L. Snyder","doi":"10.1109/DFTVS.1991.199953","DOIUrl":null,"url":null,"abstract":"The chaos router is an adaptive nonminimal message router for multicomputers that is simple enough to compete with the fast, oblivious routers now in use in commercial machines. It improves on previous adaptive routers by using randomization, which eliminates the need for complex livelock protection and speeds the router. This randomization, however, greatly complicates the fault detection because there is no worstcase bound on the time required to deliver a message. Distinguishing between lost and very slow messages is difficult. A new method of fault detection is presented that applies not only to the chaos router but also to other adaptive routers as well. In addition, solutions to several practical fault diagnosis and recovery problems in the chaos router are presented. The presentation supports the claim that fault tolerance can be incorporated into a practical router without harming performance for the normal, fault-free cases.<<ETX>>","PeriodicalId":440536,"journal":{"name":"[Proceedings] 1991 International Workshop on Defect and Fault Tolerance on VLSI Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"[Proceedings] 1991 International Workshop on Defect and Fault Tolerance on VLSI Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFTVS.1991.199953","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
The chaos router is an adaptive nonminimal message router for multicomputers that is simple enough to compete with the fast, oblivious routers now in use in commercial machines. It improves on previous adaptive routers by using randomization, which eliminates the need for complex livelock protection and speeds the router. This randomization, however, greatly complicates the fault detection because there is no worstcase bound on the time required to deliver a message. Distinguishing between lost and very slow messages is difficult. A new method of fault detection is presented that applies not only to the chaos router but also to other adaptive routers as well. In addition, solutions to several practical fault diagnosis and recovery problems in the chaos router are presented. The presentation supports the claim that fault tolerance can be incorporated into a practical router without harming performance for the normal, fault-free cases.<>