{"title":"分布式共享内存系统中一致性检查点的性能","authors":"G. Cabillic, Gilles Muller, I. Puaut","doi":"10.1109/RELDIS.1995.526217","DOIUrl":null,"url":null,"abstract":"This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.","PeriodicalId":275219,"journal":{"name":"Proceedings. 14th Symposium on Reliable Distributed Systems","volume":"295 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":"{\"title\":\"The performance of consistent checkpointing in distributed shared memory systems\",\"authors\":\"G. Cabillic, Gilles Muller, I. Puaut\",\"doi\":\"10.1109/RELDIS.1995.526217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.\",\"PeriodicalId\":275219,\"journal\":{\"name\":\"Proceedings. 14th Symposium on Reliable Distributed Systems\",\"volume\":\"295 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"58\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 14th Symposium on Reliable Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RELDIS.1995.526217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 14th Symposium on Reliable Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RELDIS.1995.526217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The performance of consistent checkpointing in distributed shared memory systems
This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.