融合CPU-GPU架构的cpu辅助GPGPU

IEEE International Symposium on High-Performance Comp Architecture Pub Date : 2012-02-01 DOI:10.1109/HPCA.2012.6168948

Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou

{"title":"融合CPU-GPU架构的cpu辅助GPGPU","authors":"Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou","doi":"10.1109/HPCA.2012.6168948","DOIUrl":null,"url":null,"abstract":"This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.","PeriodicalId":380383,"journal":{"name":"IEEE International Symposium on High-Performance Comp Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"90","resultStr":"{\"title\":\"CPU-assisted GPGPU on fused CPU-GPU architectures\",\"authors\":\"Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou\",\"doi\":\"10.1109/HPCA.2012.6168948\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.\",\"PeriodicalId\":380383,\"journal\":{\"name\":\"IEEE International Symposium on High-Performance Comp Architecture\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"90\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE International Symposium on High-Performance Comp Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2012.6168948\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on High-Performance Comp Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2012.6168948","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 90

摘要

本文提出了一种利用CPU资源在融合的CPU- gpu架构上促进GPGPU程序执行的新方法。在我们的融合架构模型中，GPU和CPU集成在同一个芯片上，共享片上L3缓存和片外内存，类似于最新的英特尔Sandy Bridge和AMD加速处理单元(APU)平台。在我们提出的CPU辅助GPGPU中，在CPU启动一个GPU程序后，它执行一个预执行程序，该程序由GPU内核使用我们提出的编译算法自动生成，并包含GPU内核的多个线程块的内存访问指令。CPU预执行程序运行在GPU线程之前，因为(1)CPU预执行线程只包含来自GPU内核的内存获取指令，而不包含浮点计算，(2)CPU运行在更高的频率上，并且比GPU标量内核利用更高程度的指令级并行性。我们还利用CPU端l2缓存上的预取器来增加来自CPU的内存流量。因此，GPU线程的内存访问在L3缓存中命中，并且它们的延迟可以大大减少。由于我们的预执行直接由用户级应用程序控制，因此它具有很高的准确性和灵活性。我们在一组基准测试上的实验表明，我们提出的预执行将性能平均提高了113%和21.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CPU-assisted GPGPU on fused CPU-GPU architectures

This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE International Symposium on High-Performance Comp Architecture

自引率

0.00%

发文量