Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen
{"title":"Fabsim-X:用于分析数据中心网络中大规模拓扑和拥塞控制协议的仿真框架","authors":"Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen","doi":"10.1109/MASCOTS50786.2020.9285933","DOIUrl":null,"url":null,"abstract":"The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fabsim-X: A Simulation Framework for the Analysis of Large-Scale Topologies and Congestion Control Protocols in Data Center Networks\",\"authors\":\"Malek Musleh, Roberto Peñaranda, Allister Alemania, P. Yébenes, Gene Y. Wu, Jan Zielinski, K. Raszkowski, N. Ni, Scott Diesing, Anupama Kurpad, R. Huggahalli, Curt E. Bruns, Steven Miller, Sujoy Sen\",\"doi\":\"10.1109/MASCOTS50786.2020.9285933\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.\",\"PeriodicalId\":272614,\"journal\":{\"name\":\"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)\",\"volume\":\"60 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MASCOTS50786.2020.9285933\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fabsim-X: A Simulation Framework for the Analysis of Large-Scale Topologies and Congestion Control Protocols in Data Center Networks
The explosive growth in cloud-computing and overall data center system growth has created an unprecedented demand on system architects and designers to continuously develop more complex system networks to effectively satisfy the insatiable appetite to process, move, and store large amounts of data. Nonlinear system behavior caused by emerging workloads and use-cases, varying end-to-end congestion protocols, and heterogeneity in the various compute and storage capabilities of custom designed accelerators further compounds the design problem. Modern simulation methodologies lack a cohesive and efficient framework to address the interoperability of the intersecting layers at scale. In this paper, we present a simulation framework for evaluating congestion control protocols. Furthermore, we present a set of optimizations that enable analysis for longer simulated times and at network scales up to 128K nodes, which is vital for proper analysis of workloads that require long run times (e.g., AI training) or workloads that are known to have scaling issues (e.g., RDMA). Specifically, we evaluate congestion control performance at various scales, study the implications of topology scaling on congestion, and the performance impact of simultaneous heterogeneous protocols.