gpu上拆分共轭梯度法的评价

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) Pub Date : 2016-04-04 DOI:10.1109/PDP.2016.9

A. Wakatani

{"title":"gpu上拆分共轭梯度法的评价","authors":"A. Wakatani","doi":"10.1109/PDP.2016.9","DOIUrl":null,"url":null,"abstract":"This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of Splitting-Up Conjugate Gradient Method on GPUs\",\"authors\":\"A. Wakatani\",\"doi\":\"10.1109/PDP.2016.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.\",\"PeriodicalId\":192273,\"journal\":{\"name\":\"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP.2016.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2016.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文描述了一种预置CG(共轭梯度)方法在gpu上的实现，并与cpu进行了性能比较。我们的CG方法利用SP (split - up)预条件，由于除一个维度外其他维度是独立的，因此适合并行处理。为了提高gpu全局内存的内存带宽，我们的实现在三对角矩阵求解器之前和之后使用伪矩阵转置，从而导致合并内存访问。此外，利用旋转组态技术可以将伪矩阵的转置次数减少到1次。通过这些技术，我们的方法的加速可以提高102.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of Splitting-Up Conjugate Gradient Method on GPUs

This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)

自引率

0.00%

发文量