优化 OpenMPI 通信库的 Allreduce 算法

International Conference on Algorithms, Microchips and Network Applications Pub Date : 2024-06-08 DOI:10.1117/12.3031959

Guangyao Zhang, Wei Wan, Junhong Li

{"title":"优化 OpenMPI 通信库的 Allreduce 算法","authors":"Guangyao Zhang, Wei Wan, Junhong Li","doi":"10.1117/12.3031959","DOIUrl":null,"url":null,"abstract":"MPI (Message Passing Interface) plays a crucial role in the field of parallel computing. In the Allreduce algorithm of the OpenMPI communication library, there are some issues in handling communication scenarios with a number of processes that is non-power-of-two. The two existing algorithms address this by excluding some processes to achieve a power-of-two process count. However, the consideration factors are too simplistic, resulting in an imbalanced distribution of participating processes on nodes, greatly impacting communication efficiency. To address this problem, the layout of processes on nodes is taken into consideration, and the range of excluded processes is redefined. Both algorithms are subjected to generic load balancing optimizations and adaptations for domestic architectures, resulting in improved load balancing. Experimental results show that, under a communication scale of 16 nodes, the recursive_doubling algorithm achieves performance improvements of up to 30%, while the reduce_scatter_allgather algorithm achieves performance improvements of up to 21%.","PeriodicalId":342847,"journal":{"name":"International Conference on Algorithms, Microchips and Network Applications","volume":" 2","pages":"1317106 - 1317106-12"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Allreduce algorithm optimization of OpenMPI communication library\",\"authors\":\"Guangyao Zhang, Wei Wan, Junhong Li\",\"doi\":\"10.1117/12.3031959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI (Message Passing Interface) plays a crucial role in the field of parallel computing. In the Allreduce algorithm of the OpenMPI communication library, there are some issues in handling communication scenarios with a number of processes that is non-power-of-two. The two existing algorithms address this by excluding some processes to achieve a power-of-two process count. However, the consideration factors are too simplistic, resulting in an imbalanced distribution of participating processes on nodes, greatly impacting communication efficiency. To address this problem, the layout of processes on nodes is taken into consideration, and the range of excluded processes is redefined. Both algorithms are subjected to generic load balancing optimizations and adaptations for domestic architectures, resulting in improved load balancing. Experimental results show that, under a communication scale of 16 nodes, the recursive_doubling algorithm achieves performance improvements of up to 30%, while the reduce_scatter_allgather algorithm achieves performance improvements of up to 21%.\",\"PeriodicalId\":342847,\"journal\":{\"name\":\"International Conference on Algorithms, Microchips and Network Applications\",\"volume\":\" 2\",\"pages\":\"1317106 - 1317106-12\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Algorithms, Microchips and Network Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.3031959\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithms, Microchips and Network Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.3031959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

MPI（消息传递接口）在并行计算领域发挥着至关重要的作用。在 OpenMPI 通信库的 Allreduce 算法中，在处理进程数为非两倍幂的通信场景时存在一些问题。现有的两种算法通过排除一些进程来解决这个问题，以实现进程数为 2 的幂次方。但是，考虑的因素过于简单，导致节点上参与进程的分布不平衡，极大地影响了通信效率。为了解决这个问题，我们考虑了节点上的进程布局，并重新定义了排除进程的范围。这两种算法都针对国内架构进行了通用负载平衡优化和调整，从而改善了负载平衡。实验结果表明，在 16 个节点的通信规模下，递归加倍算法的性能最多可提高 30%，而减少分散聚集算法的性能最多可提高 21%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Allreduce algorithm optimization of OpenMPI communication library

MPI (Message Passing Interface) plays a crucial role in the field of parallel computing. In the Allreduce algorithm of the OpenMPI communication library, there are some issues in handling communication scenarios with a number of processes that is non-power-of-two. The two existing algorithms address this by excluding some processes to achieve a power-of-two process count. However, the consideration factors are too simplistic, resulting in an imbalanced distribution of participating processes on nodes, greatly impacting communication efficiency. To address this problem, the layout of processes on nodes is taken into consideration, and the range of excluded processes is redefined. Both algorithms are subjected to generic load balancing optimizations and adaptations for domestic architectures, resulting in improved load balancing. Experimental results show that, under a communication scale of 16 nodes, the recursive_doubling algorithm achieves performance improvements of up to 30%, while the reduce_scatter_allgather algorithm achieves performance improvements of up to 21%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Algorithms, Microchips and Network Applications

自引率

0.00%

发文量