R. Graham, Steve Poole, Pavel Shamis, Gil Bloch, N. Bloch, H. Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, G. Shainer
{"title":"ConnectX-2 InfiniBand管理队列:网络卸载集体操作新支持初探","authors":"R. Graham, Steve Poole, Pavel Shamis, Gil Bloch, N. Bloch, H. Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, G. Shainer","doi":"10.1109/CCGRID.2010.9","DOIUrl":null,"url":null,"abstract":"This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"216 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":"{\"title\":\"ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations\",\"authors\":\"R. Graham, Steve Poole, Pavel Shamis, Gil Bloch, N. Bloch, H. Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, G. Shainer\",\"doi\":\"10.1109/CCGRID.2010.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.\",\"PeriodicalId\":444485,\"journal\":{\"name\":\"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing\",\"volume\":\"216 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"43\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2010.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2010.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 43
摘要
本文介绍了新开发的Infini- Band (IB)管理队列功能,该功能被主机通道适配器(HCA)用于管理网络任务数据流的依赖关系,并推进与这些流相关的通信。这些任务包括发送、接收和新支持的等待任务,并由HCA根据用户提供的数据依赖描述进行调度。ConnectX-2 HCA支持此功能,并提供将集体通信管理和进度委托给HCA的方法,也称为集体通信卸载。这提供了一种方法来重叠由HCA管理的集体通信和中央处理单元(CPU)上的计算,从而有可能减少系统噪声对使用集体操作的并行应用程序的影响。本文进一步描述了如何使用这个新功能来实现可伸缩的消息传递接口(MPI)集合操作,描述了如何使用这个新功能来实现MPI Barrier集合操作的高级细节,重点介绍了这个新功能的延迟敏感性能方面。本文最后进行了小规模的基准实验,比较了使用新的网络卸载功能的屏障集体操作的实现与使用中央处理单元管理数据流的基于点对点的相同算法的实现。这些早期的结果表明,这种新功能提供了使用集体通信来提高高性能应用程序的可伸缩性的希望。基于HCA的屏障实现的延迟与由中央处理单元管理的性能最好的基于点对点的实现的延迟相似,随着集体操作中涉及的进程数量的增加,延迟开始超过这些。
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations
This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.