Improving Congestion Control through Fine-Grain Monitoring of InfiniBand Networks

Alberto Cascajo, Gabriel Gomez-Lopez, J. Escudero-Sahuquillo, P. García, D. E. Singh, Francisco J. Alfaro-Cortés, F. Quiles, J. Carretero
{"title":"Improving Congestion Control through Fine-Grain Monitoring of InfiniBand Networks","authors":"Alberto Cascajo, Gabriel Gomez-Lopez, J. Escudero-Sahuquillo, P. García, D. E. Singh, Francisco J. Alfaro-Cortés, F. Quiles, J. Carretero","doi":"10.1109/HOTI55740.2022.00020","DOIUrl":null,"url":null,"abstract":"Congestion situations are a serious threat to the performance of the interconnection networks of High-Performance Computing and Data-Center systems. Hence, the specifications of the main interconnect technologies, such as InfiniBand, define some mechanisms to deal with congestion and its effects. However, these standard mechanisms may not be suitable to detect or track accurately the actual status of network congestion, as congestion dynamics indeed can be very complex and varied. Moreover, achieving an optimal configuration of the parameters that drive the different functionalities of congestion-control mechanisms is often a difficult task, as some configurations may be suitable for some traffic scenarios, but not for others. In this paper, we propose combining an existing light-weight platform monitoring tool (LIMITLESS) with the InfiniBand control software (OpenSM), such that the metrics about communication volumes in the network provided by the former allow the latter having a more precise image of congestion status, then being able to react more efficiently in these situations. The main contributions of this paper are the methodology to link the monitor and OpenSM, as well as modifications in the InfiniBand standard congestion-control mechanism so that its reaction is modulated based on the enhanced knowledge about congestion provided by the monitor. These improvements are ready to be integrated into any InfiniBand-based system. According to the results from our experiments (performed in a real InfiniBand-based cluster where we run a widely used benchmark), the proposed approach reduces significantly the number of wrong detections of congestion, and so the number of times that the congestion-control mechanisms react unnecessarily, hence improving system performance up to 74%. The overhead of this monitoring tool is 0.1% in our experiments, collecting data each 200ms.","PeriodicalId":115402,"journal":{"name":"2022 IEEE Symposium on High-Performance Interconnects (HOTI)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Symposium on High-Performance Interconnects (HOTI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HOTI55740.2022.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Congestion situations are a serious threat to the performance of the interconnection networks of High-Performance Computing and Data-Center systems. Hence, the specifications of the main interconnect technologies, such as InfiniBand, define some mechanisms to deal with congestion and its effects. However, these standard mechanisms may not be suitable to detect or track accurately the actual status of network congestion, as congestion dynamics indeed can be very complex and varied. Moreover, achieving an optimal configuration of the parameters that drive the different functionalities of congestion-control mechanisms is often a difficult task, as some configurations may be suitable for some traffic scenarios, but not for others. In this paper, we propose combining an existing light-weight platform monitoring tool (LIMITLESS) with the InfiniBand control software (OpenSM), such that the metrics about communication volumes in the network provided by the former allow the latter having a more precise image of congestion status, then being able to react more efficiently in these situations. The main contributions of this paper are the methodology to link the monitor and OpenSM, as well as modifications in the InfiniBand standard congestion-control mechanism so that its reaction is modulated based on the enhanced knowledge about congestion provided by the monitor. These improvements are ready to be integrated into any InfiniBand-based system. According to the results from our experiments (performed in a real InfiniBand-based cluster where we run a widely used benchmark), the proposed approach reduces significantly the number of wrong detections of congestion, and so the number of times that the congestion-control mechanisms react unnecessarily, hence improving system performance up to 74%. The overhead of this monitoring tool is 0.1% in our experiments, collecting data each 200ms.
通过ib网络的细粒度监控改善拥塞控制
拥塞情况严重威胁着高性能计算和数据中心系统互连网络的性能。因此,主要互连技术(如InfiniBand)的规范定义了一些处理拥塞及其影响的机制。然而,这些标准机制可能不适合准确地检测或跟踪网络拥塞的实际状态,因为拥塞动态确实可能非常复杂和多变。此外,实现驱动拥塞控制机制不同功能的参数的最佳配置通常是一项艰巨的任务,因为某些配置可能适合某些流量场景,但不适合其他场景。在本文中,我们建议将现有的轻量级平台监控工具(LIMITLESS)与InfiniBand控制软件(OpenSM)相结合,这样,前者提供的关于网络中通信量的指标允许后者对拥塞状态有更精确的了解,然后能够在这些情况下更有效地做出反应。本文的主要贡献是连接监视器和OpenSM的方法,以及对InfiniBand标准拥塞控制机制的修改,以便根据监视器提供的关于拥塞的增强知识来调制其反应。这些改进已准备好集成到任何基于infiniband的系统中。根据我们的实验结果(在一个真实的基于infiniband的集群中进行,我们运行了一个广泛使用的基准测试),所提出的方法显着减少了拥塞错误检测的次数,从而减少了拥塞控制机制不必要的反应次数,从而将系统性能提高了74%。在我们的实验中,这个监控工具的开销为0.1%,每200ms收集一次数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信