X10中的容错全局负载均衡

2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing Pub Date : 2014-09-01 DOI:10.1109/SYNASC.2014.69

Marco Bungart, Claudia Fohry, Jonas Posner

{"title":"X10中的容错全局负载均衡","authors":"Marco Bungart, Claudia Fohry, Jonas Posner","doi":"10.1109/SYNASC.2014.69","DOIUrl":null,"url":null,"abstract":"Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.","PeriodicalId":150575,"journal":{"name":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Fault-Tolerant Global Load Balancing in X10\",\"authors\":\"Marco Bungart, Claudia Fohry, Jonas Posner\",\"doi\":\"10.1109/SYNASC.2014.69\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.\",\"PeriodicalId\":150575,\"journal\":{\"name\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2014.69\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2014.69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

可伸缩性假定容错是有效的。我们考虑了一种用户级容错技术来处理永久性节点故障。它由主要的分区全局地址空间(PGAS)语言之一X10支持。在Resilient X10中，当一个位置(节点)发生故障时抛出异常。本文研究了不规则应用程序经常使用的任务池来平衡其负载。我们考虑全局负载平衡，每个位置有一个工作人员。每个工人维护一个私有任务池，并支持协同工作窃取。任务可以动态地生成新的任务，没有副作用，并且它们的结果是通过缩减组合起来的。我们的第一个贡献是一个可以处理永久位置故障的任务池算法。它基于定期写入其他worker的快照，并在发生窃取时进行更新。其次，我们在全局负载平衡框架GLB中实现了该算法，该框架是X10标准库的一部分。我们使用不平衡树搜索(UTS)和间中心性(BC)基准进行了实验。例如，在4个节点上有64个位置时，我们观察到使用容错GLB而不是GLB的开销约为4%。位置故障的协议开销可以忽略不计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fault-Tolerant Global Load Balancing in X10

Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

自引率

0.00%

发文量