P. Banerjee, J. T. Rahmeh, C. Stunkel, S. Nair, K. Roy, J. Abraham
{"title":"Intel超立方体多处理器的系统级容错性评估","authors":"P. Banerjee, J. T. Rahmeh, C. Stunkel, S. Nair, K. Roy, J. Abraham","doi":"10.1109/FTCS.1988.5344","DOIUrl":null,"url":null,"abstract":"A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses a novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube. The authors have implemented system-level fault-detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. They report on the results of two applications: matrix multiplication and fast Fourier transform. They have performed extensive studies of fault coverage of their system-level fault-detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. They propose a reconfiguration strategy for reconfiguring the system around faulty processors by introducing spare links and nodes.<<ETX>>","PeriodicalId":171148,"journal":{"name":"[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1988-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor\",\"authors\":\"P. Banerjee, J. T. Rahmeh, C. Stunkel, S. Nair, K. Roy, J. Abraham\",\"doi\":\"10.1109/FTCS.1988.5344\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses a novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube. The authors have implemented system-level fault-detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. They report on the results of two applications: matrix multiplication and fast Fourier transform. They have performed extensive studies of fault coverage of their system-level fault-detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. They propose a reconfiguration strategy for reconfiguring the system around faulty processors by introducing spare links and nodes.<<ETX>>\",\"PeriodicalId\":171148,\"journal\":{\"name\":\"[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1988-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FTCS.1988.5344\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FTCS.1988.5344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor
A discussion is presented of a fault-tolerant hypercube multiprocessor architecture which uses a novel algorithm-based fault-detection approach for identifying faulty processors. The scheme involves the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube. The authors have implemented system-level fault-detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. They report on the results of two applications: matrix multiplication and fast Fourier transform. They have performed extensive studies of fault coverage of their system-level fault-detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. They propose a reconfiguration strategy for reconfiguring the system around faulty processors by introducing spare links and nodes.<>