Hardware Fault Containment In Scalable Shared-memory Multiprocessors

Conference Proceedings. The 24th Annual International Symposium on Computer Architecture Pub Date : 1997-06-01 DOI:10.1145/264107.264141

D. Teodosiu, J. Baxter, Kinshuk Govil, J. Chapin, M. Rosenblum, M. Horowitz

{"title":"Hardware Fault Containment In Scalable Shared-memory Multiprocessors","authors":"D. Teodosiu, J. Baxter, Kinshuk Govil, J. Chapin, M. Rosenblum, M. Horowitz","doi":"10.1145/264107.264141","DOIUrl":null,"url":null,"abstract":"Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/264107.264141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.

查看原文本刊更多论文

可扩展共享内存多处理器中的硬件故障遏制

当前的共享内存多处理器天生就容易出现故障:任何重大的硬件或系统软件故障都会导致整个系统失败。除非有限制故障影响的规定，否则当用户将应用程序委托给更大的机器时，他们会感到可靠性降低。本文表明，故障遏制技术可以有效地应用于可扩展的共享内存多处理器，以减少由于机器尺寸增加而产生的可靠性问题。我们的方法的主要目标是不影响正常模式的性能。我们的策略是将故障造成的损害限制在机器的一部分，而不是使用昂贵的容错技术来掩盖数据和资源丢失的影响。在硬件故障后，我们运行分布式恢复算法，使机器的功能部分恢复正常运行。我们的方法是在斯坦福FLASH多处理器中实现的。使用详细的硬件模拟器，我们在一个运行Hive的FLASH系统上进行了许多故障注入实验，Hive是一个旨在支持故障遏制的操作系统。我们报告的结果验证了我们的方法，并表明与Hive这样的操作系统相结合，我们可以提高未经修改的应用程序所看到的可靠性，而不会造成很大的性能成本。仿真结果表明，我们的算法可以很好地扩展到128个处理器的系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Conference Proceedings. The 24th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量