Fabsim-X: A Simulation Framework for the Analysis of Large-Scale Topologies and Congestion Control Protocols in Data Center Networks

Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen
{"title":"Fabsim-X: A Simulation Framework for the Analysis of Large-Scale Topologies and Congestion Control Protocols in Data Center Networks","authors":"Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen","doi":"10.1109/MASCOTS50786.2020.9285933","DOIUrl":null,"url":null,"abstract":"The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.
Fabsim-X:用于分析数据中心网络中大规模拓扑和拥塞控制协议的仿真框架
云计算和整体数据中心系统的爆炸性增长对系统架构师和设计人员产生了前所未有的需求,他们需要不断开发更复杂的系统网络,以有效地满足对处理、移动和存储大量数据的永不满足的需求。由新出现的工作负载和用例、不同的端到端拥塞协议以及定制设计的加速器的各种计算和存储功能的异质性引起的非线性系统行为进一步加剧了设计问题。现代仿真方法缺乏一个内聚和有效的框架来处理大规模的交叉层的互操作性。在本文中,我们提出了一个评估拥塞控制协议的仿真框架。此外,我们提出了一组优化,可以在更长的模拟时间和网络扩展到128K节点时进行分析,这对于需要长时间运行的工作负载(例如,AI训练)或已知有扩展问题的工作负载(例如,RDMA)的适当分析至关重要。具体来说,我们评估了各种规模的拥塞控制性能,研究了拓扑缩放对拥塞的影响,以及同时异构协议的性能影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信