任务池的选择性增量备份方案

2018 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2018-07-01 DOI:10.1109/HPCS.2018.00103

Claudia Fohry, Jonas Posner, Lukas Reitz

{"title":"任务池的选择性增量备份方案","authors":"Claudia Fohry, Jonas Posner, Lukas Reitz","doi":"10.1109/HPCS.2018.00103","DOIUrl":null,"url":null,"abstract":"Checkpointing is a common approach to prevent loss of a program's state after permanent node failures. When it is performed on application-level, less data need to be saved. This paper suggests an uncoordinated application-level checkpointing technique for task pools. It selectively and incrementally saves only those tasks that have stayed in the pool during some period of time and that have not been saved before. The checkpoints are held in a resilient in-memory data store. Our technique applies to any task pool variant in which workers operate at the top of local pools, and work stealing operates at the bottom. Furthermore, the tasks must be free of side effects, and the final result must be calculated by reduction from individual task results. We implemented the technique for the lifeline-based global load balancing variant of task pools. This variant couples random victim selection with an overlay graph for termination detection. A fault-tolerant realization already exists in the form of a Java library, called JFT_GLB. It uses the APGAS and Hazelcast libraries underneath. Our implementation modifies JFT_GLB by replacing its nonselective checkpointing scheme with our new one. In experiments, we compared the overhead of the new scheme to that of JFT_GLB, with UTS, BC and two synthetic benchmarks. The new scheme required slightly more running time when local pools were small, and paid off otherwise.","PeriodicalId":308138,"journal":{"name":"2018 International Conference on High Performance Computing & Simulation (HPCS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Selective and Incremental Backup Scheme for Task Pools\",\"authors\":\"Claudia Fohry, Jonas Posner, Lukas Reitz\",\"doi\":\"10.1109/HPCS.2018.00103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Checkpointing is a common approach to prevent loss of a program's state after permanent node failures. When it is performed on application-level, less data need to be saved. This paper suggests an uncoordinated application-level checkpointing technique for task pools. It selectively and incrementally saves only those tasks that have stayed in the pool during some period of time and that have not been saved before. The checkpoints are held in a resilient in-memory data store. Our technique applies to any task pool variant in which workers operate at the top of local pools, and work stealing operates at the bottom. Furthermore, the tasks must be free of side effects, and the final result must be calculated by reduction from individual task results. We implemented the technique for the lifeline-based global load balancing variant of task pools. This variant couples random victim selection with an overlay graph for termination detection. A fault-tolerant realization already exists in the form of a Java library, called JFT_GLB. It uses the APGAS and Hazelcast libraries underneath. Our implementation modifies JFT_GLB by replacing its nonselective checkpointing scheme with our new one. In experiments, we compared the overhead of the new scheme to that of JFT_GLB, with UTS, BC and two synthetic benchmarks. The new scheme required slightly more running time when local pools were small, and paid off otherwise.\",\"PeriodicalId\":308138,\"journal\":{\"name\":\"2018 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCS.2018.00103\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2018.00103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

检查点是防止在永久性节点故障后程序状态丢失的常用方法。在应用程序级别执行时，需要保存的数据较少。提出了一种针对任务池的非协调应用层检查点技术。它有选择地、增量地只保存那些在一段时间内停留在池中并且以前没有保存过的任务。检查点保存在一个有弹性的内存数据存储中。我们的技术适用于任何任务池变体，其中工人在本地池的顶部操作，而工作窃取在底部操作。此外，任务必须没有副作用，最终结果必须通过对单个任务结果的简化来计算。我们为基于生命线的任务池的全局负载平衡变体实现了该技术。该变体将随机受害者选择与覆盖图相结合，用于终止检测。一种容错实现已经以Java库的形式存在，称为JFT_GLB。它使用下面的APGAS和Hazelcast库。我们的实现通过将JFT_GLB的非选择性检查点方案替换为我们的新方案来修改JFT_GLB。在实验中，我们将新方案的开销与JFT_GLB、UTS、BC和两个合成基准的开销进行了比较。当本地池很小时，新方案需要更多的运行时间，否则就会得到回报。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Selective and Incremental Backup Scheme for Task Pools

Checkpointing is a common approach to prevent loss of a program's state after permanent node failures. When it is performed on application-level, less data need to be saved. This paper suggests an uncoordinated application-level checkpointing technique for task pools. It selectively and incrementally saves only those tasks that have stayed in the pool during some period of time and that have not been saved before. The checkpoints are held in a resilient in-memory data store. Our technique applies to any task pool variant in which workers operate at the top of local pools, and work stealing operates at the bottom. Furthermore, the tasks must be free of side effects, and the final result must be calculated by reduction from individual task results. We implemented the technique for the lifeline-based global load balancing variant of task pools. This variant couples random victim selection with an overlay graph for termination detection. A fault-tolerant realization already exists in the form of a Java library, called JFT_GLB. It uses the APGAS and Hazelcast libraries underneath. Our implementation modifies JFT_GLB by replacing its nonselective checkpointing scheme with our new one. In experiments, we compared the overhead of the new scheme to that of JFT_GLB, with UTS, BC and two synthetic benchmarks. The new scheme required slightly more running time when local pools were small, and paid off otherwise.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量