{"title":"X10中的容错全局负载均衡","authors":"Marco Bungart, Claudia Fohry, Jonas Posner","doi":"10.1109/SYNASC.2014.69","DOIUrl":null,"url":null,"abstract":"Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.","PeriodicalId":150575,"journal":{"name":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Fault-Tolerant Global Load Balancing in X10\",\"authors\":\"Marco Bungart, Claudia Fohry, Jonas Posner\",\"doi\":\"10.1109/SYNASC.2014.69\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.\",\"PeriodicalId\":150575,\"journal\":{\"name\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2014.69\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2014.69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scalability postulates fault tolerance to be effective. We consider a user-level fault tolerance technique to cope with permanent node failures. It is supported by X10, one of the major Partitioned Global Address Space (PGAS) languages. In Resilient X10, an exception is thrown when a place (node) fails. This paper investigates task pools, which are often used by irregular applications to balance their load. We consider global load balancing with one worker per place. Each worker maintains a private task pool and supports cooperative work stealing. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. Our first contribution is a task pool algorithm that can handle permanent place failures. It is based on snapshots that are regularly written to other workers and are updated in the event of stealing. Second, we implemented the algorithm in the Global Load Balancing framework GLB, which is part of the standard library of X10. We ran experiments with the Unbalanced Tree Search (UTS) and Between ness Centrality (BC) benchmarks. With 64 places on 4 nodes, for instance, we observed an overhead of about 4% for using fault-tolerant GLB instead of GLB. The protocol overhead for a place failure was neglectable.