{"title":"Analyzing Cost-Performance Tradeoffs of HPC Network Designs under Different Constraints using Simulations","authors":"A. Bhatele, Nikhil Jain, M. Mubarak, T. Gamblin","doi":"10.1145/3316480.3325516","DOIUrl":null,"url":null,"abstract":"Identifying a suitable network topology and deciding its optimal configuration parameters are critical aspects of the overall HPC system design, procurement and installation process. Typically, multiple network topology choices are compared under the balanced injection-to-global bandwidth criterion to identify the best candidate. However, deviating from this balanced criterion may not impact application performance adversely and is often done in practice due to other considerations such as monetary cost. In this paper, we identify different practical constraints that determine the number of nodes, routers, and links, and in turn, influence dollar costs and impact network design. We design network topologies under one or more such constraints which represent different design points (iso-{*} analysis). We then perform a comprehensive, comparative evaluation of three scalable network topologies -- dragonfly, express mesh, and fat-tree -- enabled by parallel discrete-event simulations (PDES) of relevant HPC workloads. We identify network topologies that perform best under different iso-{*} configurations and compare their performance per dollar based on market data.","PeriodicalId":398793,"journal":{"name":"Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3316480.3325516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Identifying a suitable network topology and deciding its optimal configuration parameters are critical aspects of the overall HPC system design, procurement and installation process. Typically, multiple network topology choices are compared under the balanced injection-to-global bandwidth criterion to identify the best candidate. However, deviating from this balanced criterion may not impact application performance adversely and is often done in practice due to other considerations such as monetary cost. In this paper, we identify different practical constraints that determine the number of nodes, routers, and links, and in turn, influence dollar costs and impact network design. We design network topologies under one or more such constraints which represent different design points (iso-{*} analysis). We then perform a comprehensive, comparative evaluation of three scalable network topologies -- dragonfly, express mesh, and fat-tree -- enabled by parallel discrete-event simulations (PDES) of relevant HPC workloads. We identify network topologies that perform best under different iso-{*} configurations and compare their performance per dollar based on market data.