{"title":"Fault tolerance for an embedded wormhole switched network","authors":"R. Hotchkiss, B. C. O'Neill, S. Clark","doi":"10.1109/PCEE.2000.873606","DOIUrl":null,"url":null,"abstract":"The effectiveness of parallel and distributed systems depends heavily upon the reliability and efficiency of the method used for information transfer. To satisfy these requirements, the communication medium must supply fault tolerance throughout the communication layers, but should minimise operational overheads. The work described relates to a scalable communication system for a distributed-memory parallel processing architecture, which is constructed with message routing switches. The system employs a hardware mechanism that is local to each physical connection, which provides a distributed solution for fault detection and isolation. By isolating faults and the use of adaptive routing algorithms, networks may be designed that will maintain operability in the presence of faults. An explanation of the basic switch and fault isolation mechanism is provided. The paper concludes with implementation details of the operational hardware and details of the environment, in which it has been tested.","PeriodicalId":369394,"journal":{"name":"Proceedings International Conference on Parallel Computing in Electrical Engineering. PARELEC 2000","volume":"113 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings International Conference on Parallel Computing in Electrical Engineering. PARELEC 2000","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PCEE.2000.873606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The effectiveness of parallel and distributed systems depends heavily upon the reliability and efficiency of the method used for information transfer. To satisfy these requirements, the communication medium must supply fault tolerance throughout the communication layers, but should minimise operational overheads. The work described relates to a scalable communication system for a distributed-memory parallel processing architecture, which is constructed with message routing switches. The system employs a hardware mechanism that is local to each physical connection, which provides a distributed solution for fault detection and isolation. By isolating faults and the use of adaptive routing algorithms, networks may be designed that will maintain operability in the presence of faults. An explanation of the basic switch and fault isolation mechanism is provided. The paper concludes with implementation details of the operational hardware and details of the environment, in which it has been tested.