Tuning applications for efficient GPU offloading to in-memory processing

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI:10.1145/3392717.3392760

Yudong Wu, Mingyao Shen, Yi-Hui Chen, Yuanyuan Zhou

{"title":"Tuning applications for efficient GPU offloading to in-memory processing","authors":"Yudong Wu, Mingyao Shen, Yi-Hui Chen, Yuanyuan Zhou","doi":"10.1145/3392717.3392760","DOIUrl":null,"url":null,"abstract":"Data movement between processors and main memory is a critical bottleneck for data-intensive applications. This problem is more severe with Graphics Processing Units (GPUs) applications due to their massive parallel data processing characteristics. Recent research has shown that in-memory processing can greatly alleviate this data movement bottleneck by reducing traffic between GPUs and memory devices. It offloads execution to in-memory processors, and avoids transferring enormous data between memory devices and processors. However, while in-memory processing is promising, to fully take advantage of such architecture, we need to solve several issues. For example, the conventional GPU application code that is highly optimized for the locality to execute efficiently in GPU does not necessarily have good locality for in-memory processing. As such, the GPU may mistakenly offload application routines that cannot gain benefit from in-memory processing. Additionally, workload balancing cannot simply treat in-memory processors as GPU processors since its data transfer time can be significantly reduced. Finally, how to offload application routines that access the shared memory inside GPUs is still an unsolved issue. In this paper, we explore four optimizations for GPU applications to take advantage of in-memory processors. Specifically, we propose four optimizations: application restructuring, run-time adaptation, aggressive loop offloading, and shared-memory transfer on-demand to mitigate the four unsolved issues in the GPU in-memory processing system. From our experimental evaluations with 13 applications, our approach can achieve 2.23x offloading performance improvement.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392760","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Data movement between processors and main memory is a critical bottleneck for data-intensive applications. This problem is more severe with Graphics Processing Units (GPUs) applications due to their massive parallel data processing characteristics. Recent research has shown that in-memory processing can greatly alleviate this data movement bottleneck by reducing traffic between GPUs and memory devices. It offloads execution to in-memory processors, and avoids transferring enormous data between memory devices and processors. However, while in-memory processing is promising, to fully take advantage of such architecture, we need to solve several issues. For example, the conventional GPU application code that is highly optimized for the locality to execute efficiently in GPU does not necessarily have good locality for in-memory processing. As such, the GPU may mistakenly offload application routines that cannot gain benefit from in-memory processing. Additionally, workload balancing cannot simply treat in-memory processors as GPU processors since its data transfer time can be significantly reduced. Finally, how to offload application routines that access the shared memory inside GPUs is still an unsolved issue. In this paper, we explore four optimizations for GPU applications to take advantage of in-memory processors. Specifically, we propose four optimizations: application restructuring, run-time adaptation, aggressive loop offloading, and shared-memory transfer on-demand to mitigate the four unsolved issues in the GPU in-memory processing system. From our experimental evaluations with 13 applications, our approach can achieve 2.23x offloading performance improvement.

查看原文本刊更多论文

调优应用程序以实现高效的GPU卸载到内存处理

处理器和主存之间的数据移动是数据密集型应用程序的关键瓶颈。由于图形处理单元(gpu)应用程序具有大量并行数据处理特性，因此这个问题更加严重。最近的研究表明，内存处理可以通过减少gpu和内存设备之间的流量来极大地缓解这种数据移动瓶颈。它将执行任务转移到内存处理器，并避免在内存设备和处理器之间传输大量数据。然而，尽管内存处理很有前途，但要充分利用这种体系结构，我们需要解决几个问题。例如，传统的GPU应用程序代码为高效执行GPU的局部性进行了高度优化，但不一定具有良好的内存处理局部性。因此，GPU可能会错误地卸载不能从内存处理中获益的应用程序例程。此外，工作负载平衡不能简单地将内存处理器视为GPU处理器，因为它的数据传输时间可以显着减少。最后，如何卸载访问gpu内部共享内存的应用程序例程仍然是一个未解决的问题。在本文中，我们探讨了GPU应用程序的四种优化，以利用内存处理器。具体来说，我们提出了四个优化:应用程序重构、运行时适应、主动循环卸载和按需共享内存传输，以缓解GPU内存处理系统中四个未解决的问题。从我们对13个应用程序的实验评估中，我们的方法可以实现2.23倍的卸载性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量