在gpu上进行模板计算的寄存器缓存

2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing Pub Date : 2014-09-01 DOI:10.1109/SYNASC.2014.70

Thomas L. Falch, A. Elster

{"title":"在gpu上进行模板计算的寄存器缓存","authors":"Thomas L. Falch, A. Elster","doi":"10.1109/SYNASC.2014.70","DOIUrl":null,"url":null,"abstract":"For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia's Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.","PeriodicalId":150575,"journal":{"name":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Register Caching for Stencil Computations on GPUs\",\"authors\":\"Thomas L. Falch, A. Elster\",\"doi\":\"10.1109/SYNASC.2014.70\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia's Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.\",\"PeriodicalId\":150575,\"journal\":{\"name\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2014.70\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2014.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

对于大多数应用程序，充分利用内存系统是在gpu上实现良好性能的关键。在本文中，我们介绍了寄存器缓存，这是一种新颖的思想，将多个线程的寄存器组合在一起，作为共享的、最后一级的、为贡献线程手动管理的缓存。这种方法是由Nvidia的Kepler GPU架构中最近引入的shuffle指令启用的，它允许相同warp中的线程直接交换数据，而以前只能通过共享内存。我们用一个模板计算基准来评估我们的提议，与在GTX680 GPU上使用共享内存相比，实现了高达2.04的加速。模板计算构成了许多科学应用的核心，因此可以从我们的建议中受益。此外，我们的方法不仅限于模板计算，而且适用于任何具有适合手动缓存的可预测内存访问模式的应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia's Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

自引率

0.00%

发文量