重新审视时态阻塞模板优化

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2023-05-12 DOI:10.1145/3577193.3593716

Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka

{"title":"重新审视时态阻塞模板优化","authors":"Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka","doi":"10.1145/3577193.3593716","DOIUrl":null,"url":null,"abstract":"Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Revisiting Temporal Blocking Stencil Optimizations\",\"authors\":\"Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka\",\"doi\":\"10.1145/3577193.3593716\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593716\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

迭代模板在高性能计算(HPC)应用中广泛使用。鉴于GPU加速的超级计算机的普及，许多努力已经投入到优化模板GPU内核。为了提高数据的局部性，时间阻塞是一种将一批时间步合并在一起处理的优化方法。在观察到gpu在某些方面正在演变成类似cpu的情况下，我们重新审视gpu的时间阻塞优化。我们探讨了时间阻塞方案如何适应最新Nvidia gpu的新功能，包括大刮擦板内存，硬件预取和设备范围同步。我们提出了一种新的时间阻塞方法，EBISU，它支持低设备占用来驱动大块上逐块执行的侵略性深度时间阻塞。我们将EBISU与最先进的时间块库进行比较:STENCILGEN和AN5D。我们还比较了配备了时间阻塞优化的最先进的模板自动调优工具:ARTEMIS和DRSTENCIL。在广泛的模具基准测试中，EBISU在每个模具基准测试中实现了高达2.53倍的加速和1.49倍的几何平均加速，达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Revisiting Temporal Blocking Stencil Optimizations

Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量