{"title":"High-Performance PDES on Manycore Clusters","authors":"B. Williams, Ali Eker, K. Chiu, D. Ponomarev","doi":"10.1145/3437959.3459252","DOIUrl":null,"url":null,"abstract":"Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per communication thread to limit the pressure on each communication thread; and (3) balancing the timing of communication and computation threads to ensure their synchronized forward progress. Combined, these optimizations result in a 2X - 16X speedup over baseline implementations in ROSS simulator.","PeriodicalId":169025,"journal":{"name":"Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437959.3459252","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per communication thread to limit the pressure on each communication thread; and (3) balancing the timing of communication and computation threads to ensure their synchronized forward progress. Combined, these optimizations result in a 2X - 16X speedup over baseline implementations in ROSS simulator.