带缓存的GPGPU密集矩阵乘法自动调优

2010 IEEE 16th International Conference on Parallel and Distributed Systems Pub Date : 2010-12-08 DOI:10.1109/ICPADS.2010.64

Xiang Cui, Yifeng Chen, Changyou Zhang, Hong Mei

{"title":"带缓存的GPGPU密集矩阵乘法自动调优","authors":"Xiang Cui, Yifeng Chen, Changyou Zhang, Hong Mei","doi":"10.1109/ICPADS.2010.64","DOIUrl":null,"url":null,"abstract":"In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code.","PeriodicalId":365914,"journal":{"name":"2010 IEEE 16th International Conference on Parallel and Distributed Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Auto-tuning Dense Matrix Multiplication for GPGPU with Cache\",\"authors\":\"Xiang Cui, Yifeng Chen, Changyou Zhang, Hong Mei\",\"doi\":\"10.1109/ICPADS.2010.64\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code.\",\"PeriodicalId\":365914,\"journal\":{\"name\":\"2010 IEEE 16th International Conference on Parallel and Distributed Systems\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE 16th International Conference on Parallel and Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPADS.2010.64\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 16th International Conference on Parallel and Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS.2010.64","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

在本文中，我们讨论了我们使用CUDA在Fermi架构上提高GEMM(单精度和双精度)性能的经验，以及Fermi的新特性(如缓存)如何影响性能。研究发现，在GPU中增加缓存一方面有助于处理器利用运行时发生的数据局部性，但另一方面使性能对算法参数的依赖性变得不可预测。自动调优成为解决这个问题的有用技术。我们的自动调谐SGEMM和DGEMM在Tesla C2050上分别达到563 GFlops和253 GFlops。设计和实现完全使用CUDA和C，并且没有从二进制代码级别的调优中受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

In this paper we discuss about our experiences in improving the performance of GEMM (both single and double precision) on Fermi architecture using CUDA, and how the new features of Fermi such as cache affect performance. It is found that the addition of cache in GPU on one hand helps the processers take advantage of data locality occurred in runtime but on the other hand renders the dependency of performance on algorithmic parameters less predictable. Auto tuning then becomes a useful technique to address this issue. Our auto-tuned SGEMM and DGEMM reach 563 GFlops and 253 GFlops respectively on Tesla C2050. The design and implementation entirely use CUDA and C and have not benefited from tuning at the level of binary code.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE 16th International Conference on Parallel and Distributed Systems

自引率

0.00%

发文量