Hybrid Approach to Optimize MPI Collectives by In-network-computation and Point-to-Point Messages

2022 7th International Conference on Computer and Communication Systems (ICCCS) Pub Date : 2022-04-22 DOI:10.1109/icccs55155.2022.9846190

Shuping Chen, Wangquan He, Fengbin Qi, Yan Zheng, K. Yu

{"title":"Hybrid Approach to Optimize MPI Collectives by In-network-computation and Point-to-Point Messages","authors":"Shuping Chen, Wangquan He, Fengbin Qi, Yan Zheng, K. Yu","doi":"10.1109/icccs55155.2022.9846190","DOIUrl":null,"url":null,"abstract":"Using in-network-computation capabilities of the network devices (also called hardware collectives) to optimize MPI collectives has become popular in high-performance computing, and shows significant performance advantages. However, the hardware collectives are not flawless in practical use scenarios. One of the problems is that it is difficult to use. In order to obtain the performance advantage of hardware collectives, the network management software need to create dedicated aggregate tree for each MPI communicator, which is a complicated task. One solution is to make MPI communicators sharing the global imprecise aggregate trees created by the management software when initiating networks, but it leads to heavy interference between MPI communicators and causes significant performance degradation. So we have to make tradeoff between performance and ease of use. We propose a hybrid approach to optimize MPI collectives by in-network-computation and point-to-point messages. On the one hand, we use the pre-created aggregate trees in each super-node, rather than sending requests to the network management software to create dedicated aggregate trees. On the other hand, the hardware collectives are transferred only in the local super-node, so it cannot disturb the jobs running on other super-nodes. We provide a cost model to evaluate the overhead of the hybrid collective algorithms. We also test its performance in the new generation Sunway supercomputer. The results show that our approach reduces the median latency by 18%~74% compared to collectives implemented by point-to-point messages, although the performance decrease slightly compared to the original hardware collectives. In addition, the tail latency of our approach is significantly lower than that of the original hardware collectives in the presence of heavy interference.","PeriodicalId":121713,"journal":{"name":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icccs55155.2022.9846190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Using in-network-computation capabilities of the network devices (also called hardware collectives) to optimize MPI collectives has become popular in high-performance computing, and shows significant performance advantages. However, the hardware collectives are not flawless in practical use scenarios. One of the problems is that it is difficult to use. In order to obtain the performance advantage of hardware collectives, the network management software need to create dedicated aggregate tree for each MPI communicator, which is a complicated task. One solution is to make MPI communicators sharing the global imprecise aggregate trees created by the management software when initiating networks, but it leads to heavy interference between MPI communicators and causes significant performance degradation. So we have to make tradeoff between performance and ease of use. We propose a hybrid approach to optimize MPI collectives by in-network-computation and point-to-point messages. On the one hand, we use the pre-created aggregate trees in each super-node, rather than sending requests to the network management software to create dedicated aggregate trees. On the other hand, the hardware collectives are transferred only in the local super-node, so it cannot disturb the jobs running on other super-nodes. We provide a cost model to evaluate the overhead of the hybrid collective algorithms. We also test its performance in the new generation Sunway supercomputer. The results show that our approach reduces the median latency by 18%~74% compared to collectives implemented by point-to-point messages, although the performance decrease slightly compared to the original hardware collectives. In addition, the tail latency of our approach is significantly lower than that of the original hardware collectives in the presence of heavy interference.

查看原文本刊更多论文

网络内计算和点对点消息混合优化MPI集合的方法

使用网络设备(也称为硬件集合)的网络内计算能力来优化MPI集合在高性能计算中已经很流行，并显示出显著的性能优势。然而，硬件集合在实际使用场景中并非完美无瑕。其中一个问题是很难使用。为了获得硬件集合的性能优势，网络管理软件需要为每个MPI通信器创建专用的聚合树，这是一项复杂的任务。一种解决方案是让MPI通信器共享管理软件在发起网络时创建的全局不精确聚合树，但这会导致MPI通信器之间的严重干扰，并导致显著的性能下降。所以我们必须在性能和易用性之间做出权衡。我们提出了一种通过网络内计算和点对点消息来优化MPI集合的混合方法。一方面，我们在每个超级节点中使用预先创建的聚合树，而不是向网管软件发送请求来创建专用的聚合树。另一方面，硬件集合只在本地超级节点中传输，因此不会干扰在其他超级节点上运行的作业。我们提供了一个成本模型来评估混合集体算法的开销。我们还在新一代神威超级计算机上测试了它的性能。结果表明，与点到点消息实现的集合相比，我们的方法将中位延迟降低了18%~74%，尽管性能与原始硬件集合相比略有下降。此外，在存在严重干扰的情况下，我们的方法的尾部延迟明显低于原始硬件集合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 7th International Conference on Computer and Communication Systems (ICCCS)

自引率

0.00%

发文量