在不牺牲节点间通信灵活性的前提下，通过分层算法加速MPI集体通信

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.32

Benjamin S. Parsons, Vijay S. Pai

{"title":"在不牺牲节点间通信灵活性的前提下，通过分层算法加速MPI集体通信","authors":"Benjamin S. Parsons, Vijay S. Pai","doi":"10.1109/IPDPS.2014.32","DOIUrl":null,"url":null,"abstract":"This paper presents and evaluates a universal algorithm to improve the performance of MPI collective communication operations on hierarchical clusters with many-core nodes. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication (including collectives like Alltoallv). This algorithm improves on past works that convert a specific collective algorithm into a hierarchical version and are generally restricted to fan-in, fan-out, and All gather algorithms. Experimental results show impressive performance improvements utilizing a variety of collectives from MPICH as well as the closed-source Cray MPT for the inter-node communication. The experimental evaluation tests the new algorithms with as many as 65536 cores and sees speedups over the baseline averaging 14.2x for Alltoallv, 26x for All gather, and 32.7x for Reduce-Scatter. The paper further improves inter-node communication by utilizing multiple senders from the same shared memory buffer, achieving additional speedups averaging 2.5x. The discussion also evaluates special-purpose extensions to improve intra-node communication by returning shared memory or copy-on-write protected buffers from the collective.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Accelerating MPI Collective Communications through Hierarchical Algorithms Without Sacrificing Inter-Node Communication Flexibility\",\"authors\":\"Benjamin S. Parsons, Vijay S. Pai\",\"doi\":\"10.1109/IPDPS.2014.32\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents and evaluates a universal algorithm to improve the performance of MPI collective communication operations on hierarchical clusters with many-core nodes. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication (including collectives like Alltoallv). This algorithm improves on past works that convert a specific collective algorithm into a hierarchical version and are generally restricted to fan-in, fan-out, and All gather algorithms. Experimental results show impressive performance improvements utilizing a variety of collectives from MPICH as well as the closed-source Cray MPT for the inter-node communication. The experimental evaluation tests the new algorithms with as many as 65536 cores and sees speedups over the baseline averaging 14.2x for Alltoallv, 26x for All gather, and 32.7x for Reduce-Scatter. The paper further improves inter-node communication by utilizing multiple senders from the same shared memory buffer, achieving additional speedups averaging 2.5x. The discussion also evaluates special-purpose extensions to improve intra-node communication by returning shared memory or copy-on-write protected buffers from the collective.\",\"PeriodicalId\":309291,\"journal\":{\"name\":\"2014 IEEE 28th International Parallel and Distributed Processing Symposium\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 28th International Parallel and Distributed Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS.2014.32\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

本文提出并评估了一种通用算法，以提高多核节点分层集群上MPI集体通信操作的性能。该算法利用共享内存缓冲区进行高效的节点内通信，同时仍然允许使用未修改的、不了解层次结构的传统集合进行节点间通信(包括像Alltoallv这样的集合)。该算法改进了过去的工作，将特定的集体算法转换为分层版本，并且通常仅限于扇入，扇出和所有收集算法。实验结果显示，利用来自MPICH的各种集合以及用于节点间通信的闭源Cray MPT，性能得到了令人印象深刻的改进。实验评估测试了多达65536个核心的新算法，发现Alltoallv的平均加速速度为14.2倍，All gather为26倍，Reduce-Scatter为32.7倍。通过利用来自同一共享内存缓冲区的多个发送者，本文进一步改进了节点间通信，实现了平均2.5倍的额外速度。讨论还评估了通过从集合中返回共享内存或写时复制保护缓冲区来改进节点内通信的特殊用途扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating MPI Collective Communications through Hierarchical Algorithms Without Sacrificing Inter-Node Communication Flexibility

This paper presents and evaluates a universal algorithm to improve the performance of MPI collective communication operations on hierarchical clusters with many-core nodes. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication (including collectives like Alltoallv). This algorithm improves on past works that convert a specific collective algorithm into a hierarchical version and are generally restricted to fan-in, fan-out, and All gather algorithms. Experimental results show impressive performance improvements utilizing a variety of collectives from MPICH as well as the closed-source Cray MPT for the inter-node communication. The experimental evaluation tests the new algorithms with as many as 65536 cores and sees speedups over the baseline averaging 14.2x for Alltoallv, 26x for All gather, and 32.7x for Reduce-Scatter. The paper further improves inter-node communication by utilizing multiple senders from the same shared memory buffer, achieving additional speedups averaging 2.5x. The discussion also evaluates special-purpose extensions to improve intra-node communication by returning shared memory or copy-on-write protected buffers from the collective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量