Diagnosing End-Host Network Bottlenecks in RDMA Servers

IF 3 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Kefei Liu;Jiao Zhang;Zhuo Jiang;Haoran Wei;Xiaolong Zhong;Lizhuang Tan;Tian Pan;Tao Huang
{"title":"Diagnosing End-Host Network Bottlenecks in RDMA Servers","authors":"Kefei Liu;Jiao Zhang;Zhuo Jiang;Haoran Wei;Xiaolong Zhong;Lizhuang Tan;Tian Pan;Tao Huang","doi":"10.1109/TNET.2024.3416419","DOIUrl":null,"url":null,"abstract":"In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential performance bottleneck for network applications. Intra-host network bottlenecks can result in degraded intra-host bandwidth and increased intra-host latency. In addition, RNIC network problems can result in connection failures and packet drops. Host network problems can severely degrade network performance. However, when host network problems occur, they can hardly be noticed due to the lack of a monitoring system. Furthermore, existing diagnostic mechanisms cannot efficiently diagnose host network problems. In this paper, we analyze the symptom of host network problems based on our long-term troubleshooting experience and propose Hostping, the first monitoring and diagnostic system dedicated to host networks. The core idea of Hostping is to conduct 1) loopback tests between RNICs and endpoints within the host to measure intra-host latency and bandwidth, and 2) mutual probing between RNICs on a host to measure RNIC connectivity. We have deployed Hostping on thousands of servers in our distributed machine learning system. Not only can Hostping detect and diagnose host network problems we already knew in minutes, but it also reveals eight problems we did not notice before.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 5","pages":"4302-4316"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10599388/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

In RDMA (Remote Direct Memory Access) networks, end-host networks, including intra-host networks and RNICs (RDMA NIC), were considered robust and have received little attention. However, as the RNIC line rate rapidly increases to multi-hundred gigabits, the intra-host network becomes a potential performance bottleneck for network applications. Intra-host network bottlenecks can result in degraded intra-host bandwidth and increased intra-host latency. In addition, RNIC network problems can result in connection failures and packet drops. Host network problems can severely degrade network performance. However, when host network problems occur, they can hardly be noticed due to the lack of a monitoring system. Furthermore, existing diagnostic mechanisms cannot efficiently diagnose host network problems. In this paper, we analyze the symptom of host network problems based on our long-term troubleshooting experience and propose Hostping, the first monitoring and diagnostic system dedicated to host networks. The core idea of Hostping is to conduct 1) loopback tests between RNICs and endpoints within the host to measure intra-host latency and bandwidth, and 2) mutual probing between RNICs on a host to measure RNIC connectivity. We have deployed Hostping on thousands of servers in our distributed machine learning system. Not only can Hostping detect and diagnose host network problems we already knew in minutes, but it also reveals eight problems we did not notice before.
诊断 RDMA 服务器的终端主机网络瓶颈
在 RDMA(远程直接内存访问)网络中,包括主机内网络和 RNIC(RDMA 网卡)在内的终端主机网络被认为是稳健的,很少受到关注。然而,随着 RNIC 线路速率迅速增至数百千兆比特,主机内网络成为网络应用的潜在性能瓶颈。主机内网络瓶颈会导致主机内带宽下降和主机内延迟增加。此外,RNIC 网络问题还可能导致连接失败和数据包丢失。主机网络问题会严重降低网络性能。然而,当主机网络出现问题时,由于缺乏监控系统,这些问题很难被察觉。此外,现有的诊断机制也无法有效诊断主机网络问题。在本文中,我们根据长期的故障诊断经验,分析了主机网络问题的症状,并提出了首个专用于主机网络的监控和诊断系统 Hostping。Hostping的核心思想是:1)在主机内的RNIC和端点之间进行环回测试,以测量主机内的延迟和带宽;2)在主机上的RNIC之间进行相互探测,以测量RNIC的连接性。我们已经在分布式机器学习系统的数千台服务器上部署了Hostping。Hostping 不仅能在几分钟内检测和诊断出我们已经知道的主机网络问题,而且还能发现我们以前没有注意到的八个问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE/ACM Transactions on Networking
IEEE/ACM Transactions on Networking 工程技术-电信学
CiteScore
8.20
自引率
5.40%
发文量
246
审稿时长
4-8 weeks
期刊介绍: The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信