{"title":"诊断 RDMA 服务器的终端主机网络瓶颈","authors":"Kefei Liu;Jiao Zhang;Zhuo Jiang;Haoran Wei;Xiaolong Zhong;Lizhuang Tan;Tian Pan;Tao Huang","doi":"10.1109/TNET.2024.3416419","DOIUrl":null,"url":null,"abstract":"In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential performance bottleneck for network applications. Intra-host network bottlenecks can result in degraded intra-host bandwidth and increased intra-host latency. In addition, RNIC network problems can result in connection failures and packet drops. Host network problems can severely degrade network performance. However, when host network problems occur, they can hardly be noticed due to the lack of a monitoring system. Furthermore, existing diagnostic mechanisms cannot efficiently diagnose host network problems. In this paper, we analyze the symptom of host network problems based on our long-term troubleshooting experience and propose Hostping, the first monitoring and diagnostic system dedicated to host networks. The core idea of Hostping is to conduct 1) loopback tests between RNICs and endpoints within the host to measure intra-host latency and bandwidth, and 2) mutual probing between RNICs on a host to measure RNIC connectivity. We have deployed Hostping on thousands of servers in our distributed machine learning system. Not only can Hostping detect and diagnose host network problems we already knew in minutes, but it also reveals eight problems we did not notice before.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 5","pages":"4302-4316"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Diagnosing End-Host Network Bottlenecks in RDMA Servers\",\"authors\":\"Kefei Liu;Jiao Zhang;Zhuo Jiang;Haoran Wei;Xiaolong Zhong;Lizhuang Tan;Tian Pan;Tao Huang\",\"doi\":\"10.1109/TNET.2024.3416419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential performance bottleneck for network applications. Intra-host network bottlenecks can result in degraded intra-host bandwidth and increased intra-host latency. In addition, RNIC network problems can result in connection failures and packet drops. Host network problems can severely degrade network performance. However, when host network problems occur, they can hardly be noticed due to the lack of a monitoring system. Furthermore, existing diagnostic mechanisms cannot efficiently diagnose host network problems. In this paper, we analyze the symptom of host network problems based on our long-term troubleshooting experience and propose Hostping, the first monitoring and diagnostic system dedicated to host networks. The core idea of Hostping is to conduct 1) loopback tests between RNICs and endpoints within the host to measure intra-host latency and bandwidth, and 2) mutual probing between RNICs on a host to measure RNIC connectivity. We have deployed Hostping on thousands of servers in our distributed machine learning system. Not only can Hostping detect and diagnose host network problems we already knew in minutes, but it also reveals eight problems we did not notice before.\",\"PeriodicalId\":13443,\"journal\":{\"name\":\"IEEE/ACM Transactions on Networking\",\"volume\":\"32 5\",\"pages\":\"4302-4316\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Networking\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10599388/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10599388/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Diagnosing End-Host Network Bottlenecks in RDMA Servers
In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential performance bottleneck for network applications. Intra-host network bottlenecks can result in degraded intra-host bandwidth and increased intra-host latency. In addition, RNIC network problems can result in connection failures and packet drops. Host network problems can severely degrade network performance. However, when host network problems occur, they can hardly be noticed due to the lack of a monitoring system. Furthermore, existing diagnostic mechanisms cannot efficiently diagnose host network problems. In this paper, we analyze the symptom of host network problems based on our long-term troubleshooting experience and propose Hostping, the first monitoring and diagnostic system dedicated to host networks. The core idea of Hostping is to conduct 1) loopback tests between RNICs and endpoints within the host to measure intra-host latency and bandwidth, and 2) mutual probing between RNICs on a host to measure RNIC connectivity. We have deployed Hostping on thousands of servers in our distributed machine learning system. Not only can Hostping detect and diagnose host network problems we already knew in minutes, but it also reveals eight problems we did not notice before.
期刊介绍:
The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.