{"title":"Optimizing the LULESH stencil code using concurrent collections","authors":"Chenyang Liu, Milind Kulkarni","doi":"10.1145/2830018.2830024","DOIUrl":null,"url":null,"abstract":"Writing scientific applications for modern multicore machines is a challenging task. There are a myriad of hardware solutions available for many different target applications, each having their own advantages and trade-offs. An attractive approach is Concurrent Collections (CnC), which provides a programming model that separates the concerns of the application expert from the performance expert. CnC uses a data and control flow model paired with philosophies from previous data-flow programming models and tuple-space influences. By following the CnC programming paradigm, the runtime will seamlessly exploit available parallelism regardless of the platform; however, there are limitations to its effectiveness depending on the algorithm. In this paper, we explore ways to optimize the performance of the proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH), written using Concurrent Collections. The LULESH algorithm is expressed as a minimally-constrained set of partially-ordered operations with explicit dependencies. However, performance is plagued by scheduling overhead and synchronization costs caused by the fine granularity of computation steps. In LULESH and similar stencil-codes, we show that an algorithmic CnC program can be tuned by coalescing CnC elements through step fusion and tiling to become a well-tuned and scalable application running on multi-core systems. With these optimizations, we achieve up to 38x speedup over the original implementation with good scalability for up to 48 processor machines.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"高性能计算技术","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.1145/2830018.2830024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Writing scientific applications for modern multicore machines is a challenging task. There are a myriad of hardware solutions available for many different target applications, each having their own advantages and trade-offs. An attractive approach is Concurrent Collections (CnC), which provides a programming model that separates the concerns of the application expert from the performance expert. CnC uses a data and control flow model paired with philosophies from previous data-flow programming models and tuple-space influences. By following the CnC programming paradigm, the runtime will seamlessly exploit available parallelism regardless of the platform; however, there are limitations to its effectiveness depending on the algorithm. In this paper, we explore ways to optimize the performance of the proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH), written using Concurrent Collections. The LULESH algorithm is expressed as a minimally-constrained set of partially-ordered operations with explicit dependencies. However, performance is plagued by scheduling overhead and synchronization costs caused by the fine granularity of computation steps. In LULESH and similar stencil-codes, we show that an algorithmic CnC program can be tuned by coalescing CnC elements through step fusion and tiling to become a well-tuned and scalable application running on multi-core systems. With these optimizations, we achieve up to 38x speedup over the original implementation with good scalability for up to 48 processor machines.