{"title":"GPU集群RDMA通信技术及拥塞控制","authors":"Gui Liang, Siti Norbaya Daud, N. Ismail","doi":"10.1145/3603781.3603876","DOIUrl":null,"url":null,"abstract":"Abstract. This paper discusses Remote Direct Memory Access(RDMA) communication technology and the congestion control methods for Graphics Processing Unit(GPU) clusters. The implementation methods of RDMA networks widely used in GPU clusters are studied. Three implementation modes including InfiniBand, iWARP, and RoCE are analysed with comparison of their performance and applicable environments. Then, based on the analysis of a new congestion controls algorithm, DBCC & CBFC algorithm, is proposed. This algorithm based on delay feedback control and credit flow control prevents network congestion or increased latency in GPU cluster RDMA networks. The working principles of the algorithm are introduced including calculating the adjustment amount of the sending rate, initializing the sender and receiver and mechanisms to handle packet loss and timeout. Experimental results show that the algorithm optimizes network performance with RDMA communication in GPU clusters, while avoiding congestion and minimizing packet loss. However, due to the limitation of experimental conditions, it is not possible to conduct more environmental tests. In practical application, the applicability of the algorithm needs to be carefully evaluated and adjusted according to the specific situations.","PeriodicalId":391180,"journal":{"name":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GPU Cluster RDMA communication technology and congestion control\",\"authors\":\"Gui Liang, Siti Norbaya Daud, N. Ismail\",\"doi\":\"10.1145/3603781.3603876\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract. This paper discusses Remote Direct Memory Access(RDMA) communication technology and the congestion control methods for Graphics Processing Unit(GPU) clusters. The implementation methods of RDMA networks widely used in GPU clusters are studied. Three implementation modes including InfiniBand, iWARP, and RoCE are analysed with comparison of their performance and applicable environments. Then, based on the analysis of a new congestion controls algorithm, DBCC & CBFC algorithm, is proposed. This algorithm based on delay feedback control and credit flow control prevents network congestion or increased latency in GPU cluster RDMA networks. The working principles of the algorithm are introduced including calculating the adjustment amount of the sending rate, initializing the sender and receiver and mechanisms to handle packet loss and timeout. Experimental results show that the algorithm optimizes network performance with RDMA communication in GPU clusters, while avoiding congestion and minimizing packet loss. However, due to the limitation of experimental conditions, it is not possible to conduct more environmental tests. In practical application, the applicability of the algorithm needs to be carefully evaluated and adjusted according to the specific situations.\",\"PeriodicalId\":391180,\"journal\":{\"name\":\"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3603781.3603876\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603781.3603876","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GPU Cluster RDMA communication technology and congestion control
Abstract. This paper discusses Remote Direct Memory Access(RDMA) communication technology and the congestion control methods for Graphics Processing Unit(GPU) clusters. The implementation methods of RDMA networks widely used in GPU clusters are studied. Three implementation modes including InfiniBand, iWARP, and RoCE are analysed with comparison of their performance and applicable environments. Then, based on the analysis of a new congestion controls algorithm, DBCC & CBFC algorithm, is proposed. This algorithm based on delay feedback control and credit flow control prevents network congestion or increased latency in GPU cluster RDMA networks. The working principles of the algorithm are introduced including calculating the adjustment amount of the sending rate, initializing the sender and receiver and mechanisms to handle packet loss and timeout. Experimental results show that the algorithm optimizes network performance with RDMA communication in GPU clusters, while avoiding congestion and minimizing packet loss. However, due to the limitation of experimental conditions, it is not possible to conduct more environmental tests. In practical application, the applicability of the algorithm needs to be carefully evaluated and adjusted according to the specific situations.