Hardware- and software-based collective communication on the Quadrics network

F. Petrini, S. Coll, E. Frachtenberg, A. Hoisie
{"title":"Hardware- and software-based collective communication on the Quadrics network","authors":"F. Petrini, S. Coll, E. Frachtenberg, A. Hoisie","doi":"10.1109/NCA.2001.962513","DOIUrl":null,"url":null,"abstract":"The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 /spl mu/s, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 /spl mu/s. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 /spl mu/s and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 /spl mu/s when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 /spl mu/s, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.","PeriodicalId":385607,"journal":{"name":"Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCA.2001.962513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 61

Abstract

The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and software-based collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 /spl mu/s, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 /spl mu/s. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 /spl mu/s and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 /spl mu/s when the network is flooded with a background traffic of unicast messages. On the other hand, the software-based implementation suffers from a significant performance degradation. With high load the hardware broadcast maintains a reasonably good latency, delivering messages up to 2KB in 200 /spl mu/s, while the software broadcast suffers from slightly higher latencies inherited from the synchronization mechanism. Both broadcast algorithms experience a significative performance degradation of the sustained bandwidth with large messages.
基于硬件和软件的Quadrics网络集体通信
在并行机器中有效地实现集体通信模式是一项具有挑战性的设计工作,它需要解决许多问题。在本文中,我们深入描述了Quadrics网络如何支持基于硬件和软件的集合。我们描述了该网络的两个构建块的主要特征,一个可以执行零复制用户级通信的网络接口和一个虫洞路由交换机。我们还将注意力集中在路由和流量控制算法、死锁避免以及如何将处理节点集成到全局虚拟共享内存中。在64节点的AlphaServer集群上进行的实验结果表明,在整个网络上完成基于硬件的屏障同步的时间低至6 /spl mu/s,具有很好的可扩展性。采用基于软件的同步方式,实现了良好的延迟和可扩展性,同步速度约为15 /spl mu/s。对于广播,基于硬件和基于软件的实现可以实现类似的性能,可以在13 /spl mu/s的速度下传递最多256字节的消息,并且可以在所有节点上获得288 mb /s的持续渐近带宽。基于硬件的屏障几乎对网络拥塞不敏感,当网络被单播消息的后台流量淹没时,93%的同步速度低于20 /spl mu/s。另一方面,基于软件的实现遭受了显著的性能下降。在高负载情况下,硬件广播保持了相当好的延迟,以200 /spl mu/s的速度传递高达2KB的消息,而软件广播从同步机制继承的延迟略高。两种广播算法在处理大消息时,持续带宽的性能都会显著下降。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信