一种集一致性和可恢复性于一体的可恢复分布式共享内存

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers Pub Date : 1995-06-27 DOI:10.1109/FTCS.1995.466970

Anne-Marie Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, I. Puaut

{"title":"一种集一致性和可恢复性于一体的可恢复分布式共享内存","authors":"Anne-Marie Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, I. Puaut","doi":"10.1109/FTCS.1995.466970","DOIUrl":null,"url":null,"abstract":"Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56-node Intel Paragon.<<ETX>>","PeriodicalId":309075,"journal":{"name":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":"{\"title\":\"A recoverable distributed shared memory integrating coherence and recoverability\",\"authors\":\"Anne-Marie Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, I. Puaut\",\"doi\":\"10.1109/FTCS.1995.466970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56-node Intel Paragon.<<ETX>>\",\"PeriodicalId\":309075,\"journal\":{\"name\":\"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"80\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FTCS.1995.466970\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FTCS.1995.466970","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 80

摘要

大规模分布式系统对于执行需要巨大计算能力的并行应用程序非常有吸引力。然而，它们的高概率站点故障是不可接受的，特别是对于长时间运行的应用程序。在本文中，我们解决了这个问题，并提出了一种依赖于可恢复的分布式共享内存(DSM)的检查点机制，以容忍单节点故障。虽然大多数可恢复的dsm需要特定的硬件来存储恢复数据，但我们的方案使用标准内存来存储当前和恢复数据。此外，通过扩展DSM的一致性协议，将恢复数据的管理与当前数据的管理合并。这种方法利用DSM提供的数据复制来限制检查点期间传输的页面数量。本文还介绍了我们的可恢复DSM在56节点英特尔Paragon上的实现和初步性能评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A recoverable distributed shared memory integrating coherence and recoverability

Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failures. Although most recoverable DSMs require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56-node Intel Paragon.<>

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers

自引率

0.00%

发文量