Hive:用于共享内存多处理器的故障控制

Proceedings of the fifteenth ACM symposium on Operating systems principles Pub Date : 1995-12-03 DOI:10.1145/224056.224059

J. Chapin, M. Rosenblum, Scott Devine, T. Lahiri, D. Teodosiu, Anoop Gupta

{"title":"Hive:用于共享内存多处理器的故障控制","authors":"J. Chapin, M. Rosenblum, Scott Devine, T. Lahiri, D. Teodosiu, Anoop Gupta","doi":"10.1145/224056.224059","DOIUrl":null,"url":null,"abstract":"Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This improves reliability because a hardware or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive's solution to the following key challenges : (1) fault containment, i.e. confining the effects of hardware or software faults to the cell where they occur. and (2) memory sharing among cells, which is required to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requires defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH firewall, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a unified free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH. The effects of faults were contained to the cell in which they occurred in all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0% and 11% slowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).","PeriodicalId":168455,"journal":{"name":"Proceedings of the fifteenth ACM symposium on Operating systems principles","volume":"140 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"168","resultStr":"{\"title\":\"Hive: fault containment for shared-memory multiprocessors\",\"authors\":\"J. Chapin, M. Rosenblum, Scott Devine, T. Lahiri, D. Teodosiu, Anoop Gupta\",\"doi\":\"10.1145/224056.224059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This improves reliability because a hardware or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive's solution to the following key challenges : (1) fault containment, i.e. confining the effects of hardware or software faults to the cell where they occur. and (2) memory sharing among cells, which is required to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requires defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH firewall, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a unified free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH. The effects of faults were contained to the cell in which they occurred in all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0% and 11% slowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).\",\"PeriodicalId\":168455,\"journal\":{\"name\":\"Proceedings of the fifteenth ACM symposium on Operating systems principles\",\"volume\":\"140 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"168\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the fifteenth ACM symposium on Operating systems principles\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/224056.224059\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the fifteenth ACM symposium on Operating systems principles","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/224056.224059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 168

摘要

在为大型共享内存多处理器设计操作系统时，可靠性和可伸缩性是主要考虑的问题。在本文中，我们描述了Hive，一个具有新颖内核架构的操作系统，它解决了这些问题。Hive的结构是一个内部分布式系统，由称为细胞的独立内核组成。这提高了可靠性，因为硬件或软件故障只损坏一个计算单元而不是整个系统，并且提高了可伸缩性，因为运行在不同计算单元上的进程共享的内核资源很少。Hive原型是UNIX SVR4的完整实现，目标是在斯坦福FLASH多处理器上运行。本文重点介绍Hive对以下关键挑战的解决方案:(1)故障遏制，即将硬件或软件故障的影响限制在发生故障的单元内。(2)单元之间的内存共享，这是实现与其他多处理器操作系统竞争的应用程序性能所必需的。共享内存多处理器中的故障遏制需要保护每个单元，防止其他单元中的错误引起的错误写入。Hive通过使用FLASH防火墙(与每个内存页相关联的写权限位向量)以及在检测到故障时丢弃可能损坏的页面来防止此类损坏。内存共享是通过跨单元的统一文件和虚拟内存页缓存以及统一的空闲页框架池提供的。我们报告了该系统的早期经验，包括故障注入的结果和使用FLASH精确模拟器SimOS进行的性能实验。在我们注入故障停止硬件故障的所有49个测试和注入内核数据损坏的所有20个测试中，故障的影响都包含在发生故障的单元中。Hive原型在一个四处理器四单元系统上执行测试工作负载，与SGI IRIX 5.2(它所基于的UNIX版本)相比，速度在0%到11%之间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hive: fault containment for shared-memory multiprocessors

Reliability and scalability are major concerns when designing operating systems for large-scale shared-memory multiprocessors. In this paper we describe Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This improves reliability because a hardware or software fault damages only one cell rather than the whole system, and improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. This paper focuses on Hive's solution to the following key challenges : (1) fault containment, i.e. confining the effects of hardware or software faults to the cell where they occur. and (2) memory sharing among cells, which is required to achieve application performance competitive with other multiprocessor operating systems. Fault containment in a shared-memory multiprocessor requires defending each cell against erroneous writes caused by faults in other cells. Hive prevents such damage by using the FLASH firewall, a write permission bit-vector associated with each page of memory, and by discarding potentially corrupt pages when a fault is detected. Memory sharing is provided through a unified file and virtual memory page cache across the cells, and through a unified free page frame pool. We report early experience with the system, including the results of fault injection and performance experiments using SimOS, an accurate simulator of FLASH. The effects of faults were contained to the cell in which they occurred in all 49 tests where we injected fail-stop hardware faults, and in all 20 tests where we injected kernel data corruption. The Hive prototype executes test workloads on a four-processor four-cell system with between 0% and 11% slowdown as compared to SGI IRIX 5.2 (the version of UNIX on which it is based).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the fifteenth ACM symposium on Operating systems principles

自引率

0.00%

发文量