可扩展的容错分布式共享内存

ACM/IEEE SC 2000 Conference (SC'00) Pub Date : 2000-11-01 DOI:10.1109/SC.2000.10014

F. Sultan, Thu D. Nguyen, L. Iftode

{"title":"可扩展的容错分布式共享内存","authors":"F. Sultan, Thu D. Nguyen, L. Iftode","doi":"10.1109/SC.2000.10014","DOIUrl":null,"url":null,"abstract":"This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent check- pointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointing overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs or real applications.","PeriodicalId":228250,"journal":{"name":"ACM/IEEE SC 2000 Conference (SC'00)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2000-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":"{\"title\":\"Scalable Fault-Tolerant Distributed Shared Memory\",\"authors\":\"F. Sultan, Thu D. Nguyen, L. Iftode\",\"doi\":\"10.1109/SC.2000.10014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent check- pointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointing overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs or real applications.\",\"PeriodicalId\":228250,\"journal\":{\"name\":\"ACM/IEEE SC 2000 Conference (SC'00)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2000-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"48\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM/IEEE SC 2000 Conference (SC'00)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC.2000.10014\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IEEE SC 2000 Conference (SC'00)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2000.10014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

摘要

本文展示了如何有效地扩展最先进的软件分布式共享内存(DSM)协议以容忍单节点故障。特别是，我们扩展了一个基于家庭的延迟释放一致性(HLRC) DSM系统，该系统具有独立的检查指向和对易失性内存的日志记录，目标是在非常大的基于lan的集群上进行共享内存计算。在这些环境中，全局协调的成本可能很高，因此独立检查点对可伸缩性至关重要。然而，独立检查点只有在没有全局协调的情况下才能控制日志和检查点的大小。本文描述了我们的容错DSM系统的设计，并提出了我们对检查点和日志管理问题的解决方案。我们还提供了实验结果，表明我们的容错支持是轻量级的，只增加了较低的消息传递、日志记录和检查点开销，并且我们的管理算法可以有效地约束检查点和日志或实际应用程序的大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalable Fault-Tolerant Distributed Shared Memory

This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent check- pointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointing overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs or real applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM/IEEE SC 2000 Conference (SC'00)

自引率

0.00%

发文量