Eman Bin Khunayn, Hairuo Xie, S. Karunasekera, K. Ramamohanarao
{"title":"Dynamic Straggler Mitigation for Large-Scale Spatial Simulations","authors":"Eman Bin Khunayn, Hairuo Xie, S. Karunasekera, K. Ramamohanarao","doi":"10.1145/3578933","DOIUrl":null,"url":null,"abstract":"Spatial simulations have been widely used to study real-world environments, such as transportation systems. Applications like prediction and analysis of transportation require the simulation to handle millions of objects while running faster than real time. Running such large-scale simulation requires high computational power, which can be provided through parallel distributed computing. Implementations of parallel distributed spatial simulations usually follow a bulk synchronous parallel (BSP) model to ensure the correctness of simulation. The processing in BSP is divided into iterations of computation and communication, running on multiple workers, followed by a global barrier synchronisation to ensure that all communications are concluded. Unfortunately, the BSP model is plagued by the straggler problem, where a delay in any worker slows down the entire simulation. Stragglers may occur for many reasons, including imbalanced workload distribution or communication and synchronisation delays. The straggler problem can become more severe with increasing parallelism and continuous change of workload distribution among workers. This article proposes methods to dynamically mitigate stragglers and tackle communication delays. The proposed strategies can rebalance the workload distribution during simulation. These methods employ the spatial properties of the simulated environments to combine a flexible synchronisation model with decentralised dynamic load balancing and on-demand resource allocation. All proposed methods are implemented and evaluated using a microscopic traffic simulator as an example of large-scale spatial simulations. We run traffic simulations for Melbourne, Beijing and New York with different straggler scenarios. Our methods significantly improve simulation performance compared to advanced methods such as global dynamic load balancing.","PeriodicalId":43641,"journal":{"name":"ACM Transactions on Spatial Algorithms and Systems","volume":"9 1","pages":"1 - 34"},"PeriodicalIF":1.2000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Spatial Algorithms and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"REMOTE SENSING","Score":null,"Total":0}
引用次数: 1
Abstract
Spatial simulations have been widely used to study real-world environments, such as transportation systems. Applications like prediction and analysis of transportation require the simulation to handle millions of objects while running faster than real time. Running such large-scale simulation requires high computational power, which can be provided through parallel distributed computing. Implementations of parallel distributed spatial simulations usually follow a bulk synchronous parallel (BSP) model to ensure the correctness of simulation. The processing in BSP is divided into iterations of computation and communication, running on multiple workers, followed by a global barrier synchronisation to ensure that all communications are concluded. Unfortunately, the BSP model is plagued by the straggler problem, where a delay in any worker slows down the entire simulation. Stragglers may occur for many reasons, including imbalanced workload distribution or communication and synchronisation delays. The straggler problem can become more severe with increasing parallelism and continuous change of workload distribution among workers. This article proposes methods to dynamically mitigate stragglers and tackle communication delays. The proposed strategies can rebalance the workload distribution during simulation. These methods employ the spatial properties of the simulated environments to combine a flexible synchronisation model with decentralised dynamic load balancing and on-demand resource allocation. All proposed methods are implemented and evaluated using a microscopic traffic simulator as an example of large-scale spatial simulations. We run traffic simulations for Melbourne, Beijing and New York with different straggler scenarios. Our methods significantly improve simulation performance compared to advanced methods such as global dynamic load balancing.
期刊介绍:
ACM Transactions on Spatial Algorithms and Systems (TSAS) is a scholarly journal that publishes the highest quality papers on all aspects of spatial algorithms and systems and closely related disciplines. It has a multi-disciplinary perspective in that it spans a large number of areas where spatial data is manipulated or visualized (regardless of how it is specified - i.e., geometrically or textually) such as geography, geographic information systems (GIS), geospatial and spatiotemporal databases, spatial and metric indexing, location-based services, web-based spatial applications, geographic information retrieval (GIR), spatial reasoning and mining, security and privacy, as well as the related visual computing areas of computer graphics, computer vision, geometric modeling, and visualization where the spatial, geospatial, and spatiotemporal data is central.