Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin
{"title":"A High-Performance RDMA NIC With Ultrahighly Scalable Connections","authors":"Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin","doi":"10.1109/TCAD.2024.3514782","DOIUrl":null,"url":null,"abstract":"Remote direct memory access (RDMA) technology has significantly enhanced network bandwidth and decreased transmission latency through kernel bypass and protocol offloading, overcoming obstacles in distributed computing systems. However, with the deployment of more intricate services in RDMA networks, current RDMA network interface cards (RNICs) have experienced a notable performance decline as the number of queue pair (QP) connections increases, substantially constraining the broad acceptance of RDMA networks. To address this challenge, this article proposes a novel RNIC architecture with high connection scalability. This architecture incorporates a multitiered cache structure to handle diverse communication contexts, enabling RNIC to support ultrahigh QP connection numbers while minimizing on-chip memory usage. In addition, the architecture facilitates chain prefetching, allowing on-chip caches to manage multiple concurrent requests; thus, averting latency resulting from cache misses and access conflicts during communication under concurrent multiple QP scenarios. This ensures transmission performance in multi-QPs connection scenarios. This article implements and validates the performance of a 100G RNIC based on this architecture on Xilinx’s U280 FPGA. With approximately 1 M memory usage on-chip for context, it can support 64 K performant QP connections (<inline-formula> <tex-math>$25\\times $ </tex-math></inline-formula> than CX-6) and can be extended if necessary. Experimental results confirm the high connection scalability of the RNIC, achieving approximately 92 Gb/s network throughput for data packet transmission with concurrent execution of 1–64 K QPs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2156-2167"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10787243/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Remote direct memory access (RDMA) technology has significantly enhanced network bandwidth and decreased transmission latency through kernel bypass and protocol offloading, overcoming obstacles in distributed computing systems. However, with the deployment of more intricate services in RDMA networks, current RDMA network interface cards (RNICs) have experienced a notable performance decline as the number of queue pair (QP) connections increases, substantially constraining the broad acceptance of RDMA networks. To address this challenge, this article proposes a novel RNIC architecture with high connection scalability. This architecture incorporates a multitiered cache structure to handle diverse communication contexts, enabling RNIC to support ultrahigh QP connection numbers while minimizing on-chip memory usage. In addition, the architecture facilitates chain prefetching, allowing on-chip caches to manage multiple concurrent requests; thus, averting latency resulting from cache misses and access conflicts during communication under concurrent multiple QP scenarios. This ensures transmission performance in multi-QPs connection scenarios. This article implements and validates the performance of a 100G RNIC based on this architecture on Xilinx’s U280 FPGA. With approximately 1 M memory usage on-chip for context, it can support 64 K performant QP connections ($25\times $ than CX-6) and can be extended if necessary. Experimental results confirm the high connection scalability of the RNIC, achieving approximately 92 Gb/s network throughput for data packet transmission with concurrent execution of 1–64 K QPs.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.