{"title":"非结构六面体网格上可伸缩高阶可压缩流求解器的性能改进","authors":"Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki","doi":"10.1145/3368474.3368480","DOIUrl":null,"url":null,"abstract":"This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids\",\"authors\":\"Kazuma Tago, T. Haga, S. Tsutsumi, R. Takaki\",\"doi\":\"10.1145/3368474.3368480\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).\",\"PeriodicalId\":314778,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3368474.3368480\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Improvement of a Scalable High-Order Compressible Flow Solver on Unstructured Hexahedral Grids
This paper describes LS-FLOW-HO, a high-order compressible flow solver based on the Flux Reconstruction(FR) method, and its performance optimization. The Flux Reconstruction method achieves arbitrary high-order accuracy on unstructured grids and is suitable for many core architectures because of the local data sets (Stencil) involved in spatial discretization. This study focuses on the performance optimization of the PRIMEHPC FX100, a Fujitsu scalar supercomputer. First, the execution time of sample code that uses the BLAS library is compared with that of code that uses a sparse matrix multiplication which calculates only non-zero values. It is found that the sparse matrix multiplication takes less time than using DGEMM for hexahedral elements when the degree of interpolation polynomial is higher than 2. Using sparse matrix multiplication, hot spot tuning was done by extracting each subroutine code from LS-FLOW-HO. The speedup was confirmed by changing the array structure in the cell boundary, improving the memory/cache access latency by the sequential memory access, and increasing loop length by loop collapsing. Applying these tunings to LS-FLOW-HO, execution time was reduced by up to 40%, and reached 10.23% of the theoretical FLOPS peak using 16 threads of OpenMP on a single node. The performance on Intel Haswell was also shown as the execution time is reduced by about 49%. It was confirmed that the proposed techniques are effective on other processors. Finally, sustained strong scaling performance for real application to supersonic jets is demonstrated using 32 to 3200 nodes (1024 to 102400 cores).