优化非交换的Allreduce虚拟化，可迁移的MPI排名

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI:10.1109/IPDPSW55747.2022.00085

Sam White, L. Kalé

{"title":"优化非交换的Allreduce虚拟化，可迁移的MPI排名","authors":"Sam White, L. Kalé","doi":"10.1109/IPDPSW55747.2022.00085","DOIUrl":null,"url":null,"abstract":"Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"283 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Non-commutative Allreduce Over Virtualized, Migratable MPI Ranks\",\"authors\":\"Sam White, L. Kalé\",\"doi\":\"10.1109/IPDPSW55747.2022.00085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.\",\"PeriodicalId\":286968,\"journal\":{\"name\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"283 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW55747.2022.00085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对于基于mpi的应用程序来说，动态负载平衡可能很困难。应用程序逻辑和算法经常被重写，以支持对域的动态重新划分。另一种方法是将MPI级别虚拟化为线程(而不是操作系统进程)，并在系统周围迁移线程以平衡计算负载。自适应MPI就是这样一种实现。它支持将MPI列为可迁移的用户级线程的虚拟化。但是，这种可移植性本身会给应用程序带来新的性能开销。在本文中，我们确定了非交换约简操作对于任何支持用户定义的秩初始映射或在机器的核心或节点之间动态迁移秩的运行时都是有问题的。我们研究了与支持高效非交换约简操作相关的挑战，并探索了算法替代方案，如递归加倍和减半，并结合了一种新的自适应消息组合技术。我们探讨了针对不同消息大小和等级到核心映射的不同算法的权衡，并使用微基准测试演示了我们的性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Non-commutative Allreduce Over Virtualized, Migratable MPI Ranks

Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads-instead of operating system processes- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and explore algorithmic alternatives such as recursive doubling and halving in combination with a novel adaptive message combining technique. We explore tradeoffs in the different algorithms for various message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量