Efficient collective operations using remote memory operations on VIA-based clusters

Proceedings International Parallel and Distributed Processing Symposium Pub Date : 2003-04-22 DOI:10.1109/IPDPS.2003.1213135

Rinku Gupta, P. Balaji, D. Panda, J. Nieplocha

{"title":"Efficient collective operations using remote memory operations on VIA-based clusters","authors":"Rinku Gupta, P. Balaji, D. Panda, J. Nieplocha","doi":"10.1109/IPDPS.2003.1213135","DOIUrl":null,"url":null,"abstract":"High performance scientific applications require efficient and fast collective communication operations. Most collective communication operations have been built on top of point-to-point send/receive primitives. Modern user-level protocols such as VIA and the emerging InfiniBand architecture support remote DMA operations. These operations not only allow data to be moved between the nodes with low overhead but also allow the user to create and provide a logical shared memory address space across the nodes. This feature demonstrates potential for designing high performance and scalable collective operations. In this paper, we discuss the various design issues that may be the basis of a RDMA supported collective communication library. As a proof of concept, we have designed and implemented the RDMA-based broadcast and the RDMA-based allreduce operations. For RDMA-based broadcast, we get a benefit of 14%, when compared to send/receive-based broadcast for 4KB data size on a 16 node cluster. We also introduce a new reduce algorithm called as the Degree-k tree-based reduce algorithm. Combining the RDMA mechanism with the new reduce algorithm shows a benefit of 38% for 4 byte messages and 9% for 4KB messages on a 16 node cluster for the allreduce operation. We also introduce analytical models for broadcast and allreduce to predict the performance of this design for large-scale clusters. These analytical models yield a performance benefit of about 35-40% for 4 bytes and around 14% for 4KB messages for 512 and 1024 node clusters for the allreduce operation.","PeriodicalId":177848,"journal":{"name":"Proceedings International Parallel and Distributed Processing Symposium","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2003.1213135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

High performance scientific applications require efficient and fast collective communication operations. Most collective communication operations have been built on top of point-to-point send/receive primitives. Modern user-level protocols such as VIA and the emerging InfiniBand architecture support remote DMA operations. These operations not only allow data to be moved between the nodes with low overhead but also allow the user to create and provide a logical shared memory address space across the nodes. This feature demonstrates potential for designing high performance and scalable collective operations. In this paper, we discuss the various design issues that may be the basis of a RDMA supported collective communication library. As a proof of concept, we have designed and implemented the RDMA-based broadcast and the RDMA-based allreduce operations. For RDMA-based broadcast, we get a benefit of 14%, when compared to send/receive-based broadcast for 4KB data size on a 16 node cluster. We also introduce a new reduce algorithm called as the Degree-k tree-based reduce algorithm. Combining the RDMA mechanism with the new reduce algorithm shows a benefit of 38% for 4 byte messages and 9% for 4KB messages on a 16 node cluster for the allreduce operation. We also introduce analytical models for broadcast and allreduce to predict the performance of this design for large-scale clusters. These analytical models yield a performance benefit of about 35-40% for 4 bytes and around 14% for 4KB messages for 512 and 1024 node clusters for the allreduce operation.

查看原文本刊更多论文

在基于via的集群上使用远程内存操作的高效集体操作

高性能的科学应用需要高效、快速的集体通信操作。大多数集体通信操作都是建立在点对点发送/接收原语之上的。VIA和新兴的InfiniBand架构等现代用户级协议支持远程DMA操作。这些操作不仅允许在节点之间以低开销移动数据，而且还允许用户跨节点创建和提供逻辑共享内存地址空间。这个特性展示了设计高性能和可伸缩的集合操作的潜力。在本文中，我们讨论了各种设计问题，这些问题可能是支持RDMA的集体通信库的基础。作为概念验证，我们设计并实现了基于rdma的广播和基于rdma的allreduce操作。对于基于rdma的广播，在16个节点的集群上，与基于发送/接收的广播(4KB数据大小)相比，我们获得了14%的收益。我们还介绍了一种新的约简算法，称为度k树约简算法。将RDMA机制与新的reduce算法相结合，在16节点集群上使用allreduce操作，4字节消息的收益为38%，4KB消息的收益为9%。我们还引入了广播和allreduce的分析模型来预测该设计在大规模集群中的性能。对于512和1024节点集群的allreduce操作，这些分析模型对于4字节消息的性能提升约为35-40%，对于4KB消息的性能提升约为14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量