{"title":"在gpu上进行模板计算的寄存器缓存","authors":"Thomas L. Falch, A. Elster","doi":"10.1109/SYNASC.2014.70","DOIUrl":null,"url":null,"abstract":"For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia's Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.","PeriodicalId":150575,"journal":{"name":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Register Caching for Stencil Computations on GPUs\",\"authors\":\"Thomas L. Falch, A. Elster\",\"doi\":\"10.1109/SYNASC.2014.70\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia's Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.\",\"PeriodicalId\":150575,\"journal\":{\"name\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SYNASC.2014.70\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2014.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
For most applications, taking full advantage of the memory system is key to achieving good performance on GPUs. In this paper, we introduce register caching, a novel idea where registers of multiple threads are combined and used as a shared, last level, manually managed cache for the contributing threads. This method is enabled by the shuffle instruction recently introduced in Nvidia's Kepler GPU architecture, which allows threads in the same warp to exchange data directly, previously only possible by going through shared memory. We evaluate our proposal with a stencil computation benchmark, achieving speedups of up to 2.04, compared to using shared memory on a GTX680 GPU. Stencil computations form the core of many scientific applications, which can therefore benefit from our proposal. Furthermore, our method is not limited to stencil computations, but is applicable to any application with a predictable memory access pattern suitable for manual caching.