MPI中分层多hca感知集合的设计

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3547276.3548524

Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda

{"title":"MPI中分层多hca感知集合的设计","authors":"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda","doi":"10.1145/3547276.3548524","DOIUrl":null,"url":null,"abstract":"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Designing Hierarchical Multi-HCA Aware Allgather in MPI\",\"authors\":\"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda\",\"doi\":\"10.1145/3547276.3548524\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.\",\"PeriodicalId\":255540,\"journal\":{\"name\":\"Workshop Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"180 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3547276.3548524\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3547276.3548524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

为了加速节点之间的通信，超级计算机现在为每个节点配备了多个网络适配器，从而形成了“多轨”网络。Top500中排名第二和第三的系统每个节点使用两个适配器;最近，阿贡国家实验室(ANL)的ThetaGPU系统在每个节点上使用8个适配器。有了如此多的网络资源，利用所有这些资源是一项非常重要的任务。消息传递接口(Message Passing Interface, MPI)是高性能计算集群的主流模型。并非所有MPI集合都利用所有资源，随着给定集群中带宽和适配器数量的增加，这一点变得更加明显。在这项工作中，我们承担了这个任务，提出了分层的、多hca感知的Allgather设计;Allgather是一个通信密集型集合，广泛用于矩阵乘法和其他集合等应用程序。所提出的设计充分利用节点内所有可用的网络适配器，并提供节点间和节点内通信的高度重叠。在微基准测试水平上，我们的新方案实现了单节点和多节点通信的性能改进。对于1024个进程，我们看到节点间的改进比HPC-X和MVAPICH2-X分别提高了62%和61%。与HPC-X和MVAPICH2-X相比，节点间通信的设计还使Ring Allreduce的性能提高了56%和44%。在应用程序层面，与HPC-X和MVAPICH2-X相比，增强的Allgather在矩阵向量乘法内核上的性能提高了1.98倍和1.42倍，而Allreduce在深度学习训练中比MVAPICH2-X的性能提高了7.83%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Designing Hierarchical Multi-HCA Aware Allgather in MPI

To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量