Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin
{"title":"具有超高扩展连接的高性能RDMA网卡","authors":"Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin","doi":"10.1109/TCAD.2024.3514782","DOIUrl":null,"url":null,"abstract":"Remote direct memory access (RDMA) technology has significantly enhanced network bandwidth and decreased transmission latency through kernel bypass and protocol offloading, overcoming obstacles in distributed computing systems. However, with the deployment of more intricate services in RDMA networks, current RDMA network interface cards (RNICs) have experienced a notable performance decline as the number of queue pair (QP) connections increases, substantially constraining the broad acceptance of RDMA networks. To address this challenge, this article proposes a novel RNIC architecture with high connection scalability. This architecture incorporates a multitiered cache structure to handle diverse communication contexts, enabling RNIC to support ultrahigh QP connection numbers while minimizing on-chip memory usage. In addition, the architecture facilitates chain prefetching, allowing on-chip caches to manage multiple concurrent requests; thus, averting latency resulting from cache misses and access conflicts during communication under concurrent multiple QP scenarios. This ensures transmission performance in multi-QPs connection scenarios. This article implements and validates the performance of a 100G RNIC based on this architecture on Xilinx’s U280 FPGA. With approximately 1 M memory usage on-chip for context, it can support 64 K performant QP connections (<inline-formula> <tex-math>$25\\times $ </tex-math></inline-formula> than CX-6) and can be extended if necessary. Experimental results confirm the high connection scalability of the RNIC, achieving approximately 92 Gb/s network throughput for data packet transmission with concurrent execution of 1–64 K QPs.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2156-2167"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A High-Performance RDMA NIC With Ultrahighly Scalable Connections\",\"authors\":\"Ling Zhang;Xuefei Yang;Zhenlong Wan;Hang Liu;Wei Gu;Pingjing Liu;Qilin Dai;Shanwei Ye;Yingcheng Lin\",\"doi\":\"10.1109/TCAD.2024.3514782\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Remote direct memory access (RDMA) technology has significantly enhanced network bandwidth and decreased transmission latency through kernel bypass and protocol offloading, overcoming obstacles in distributed computing systems. However, with the deployment of more intricate services in RDMA networks, current RDMA network interface cards (RNICs) have experienced a notable performance decline as the number of queue pair (QP) connections increases, substantially constraining the broad acceptance of RDMA networks. To address this challenge, this article proposes a novel RNIC architecture with high connection scalability. This architecture incorporates a multitiered cache structure to handle diverse communication contexts, enabling RNIC to support ultrahigh QP connection numbers while minimizing on-chip memory usage. In addition, the architecture facilitates chain prefetching, allowing on-chip caches to manage multiple concurrent requests; thus, averting latency resulting from cache misses and access conflicts during communication under concurrent multiple QP scenarios. This ensures transmission performance in multi-QPs connection scenarios. This article implements and validates the performance of a 100G RNIC based on this architecture on Xilinx’s U280 FPGA. With approximately 1 M memory usage on-chip for context, it can support 64 K performant QP connections (<inline-formula> <tex-math>$25\\\\times $ </tex-math></inline-formula> than CX-6) and can be extended if necessary. Experimental results confirm the high connection scalability of the RNIC, achieving approximately 92 Gb/s network throughput for data packet transmission with concurrent execution of 1–64 K QPs.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 6\",\"pages\":\"2156-2167\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10787243/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10787243/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
摘要
远程直接内存访问(RDMA)技术通过内核绕过和协议卸载,极大地提高了网络带宽,降低了传输延迟,克服了分布式计算系统中的一些障碍。然而,随着在RDMA网络中部署更复杂的业务,当前的RDMA网络接口卡(rnic)的性能随着队列对(QP)连接数量的增加而显著下降,这在很大程度上限制了RDMA网络的广泛接受。为了解决这一挑战,本文提出了一种具有高连接可伸缩性的新颖RNIC架构。该架构采用多层缓存结构来处理不同的通信上下文,使RNIC能够支持超高QP连接数,同时最大限度地减少片上内存的使用。此外,该架构促进链预取,允许片上缓存管理多个并发请求;因此,避免了在并发多QP场景下通信期间由于缓存丢失和访问冲突而导致的延迟。这保证了在多qp连接场景下的传输性能。本文在Xilinx的U280 FPGA上实现并验证了基于该架构的100G RNIC的性能。在芯片上使用大约1m内存的情况下,它可以支持64 K高性能QP连接(比CX-6高25倍),并且可以在必要时扩展。实验结果证实了RNIC的高连接可扩展性,数据包传输的网络吞吐量约为92 Gb/s,并发执行1-64 K QPs。
A High-Performance RDMA NIC With Ultrahighly Scalable Connections
Remote direct memory access (RDMA) technology has significantly enhanced network bandwidth and decreased transmission latency through kernel bypass and protocol offloading, overcoming obstacles in distributed computing systems. However, with the deployment of more intricate services in RDMA networks, current RDMA network interface cards (RNICs) have experienced a notable performance decline as the number of queue pair (QP) connections increases, substantially constraining the broad acceptance of RDMA networks. To address this challenge, this article proposes a novel RNIC architecture with high connection scalability. This architecture incorporates a multitiered cache structure to handle diverse communication contexts, enabling RNIC to support ultrahigh QP connection numbers while minimizing on-chip memory usage. In addition, the architecture facilitates chain prefetching, allowing on-chip caches to manage multiple concurrent requests; thus, averting latency resulting from cache misses and access conflicts during communication under concurrent multiple QP scenarios. This ensures transmission performance in multi-QPs connection scenarios. This article implements and validates the performance of a 100G RNIC based on this architecture on Xilinx’s U280 FPGA. With approximately 1 M memory usage on-chip for context, it can support 64 K performant QP connections ($25\times $ than CX-6) and can be extended if necessary. Experimental results confirm the high connection scalability of the RNIC, achieving approximately 92 Gb/s network throughput for data packet transmission with concurrent execution of 1–64 K QPs.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.