{"title":"gpu上拆分共轭梯度法的评价","authors":"A. Wakatani","doi":"10.1109/PDP.2016.9","DOIUrl":null,"url":null,"abstract":"This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of Splitting-Up Conjugate Gradient Method on GPUs\",\"authors\":\"A. Wakatani\",\"doi\":\"10.1109/PDP.2016.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.\",\"PeriodicalId\":192273,\"journal\":{\"name\":\"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP.2016.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2016.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluation of Splitting-Up Conjugate Gradient Method on GPUs
This paper describes the implementation of a preconditioned CG (Conjugate Gradient) method on GPUs and evaluates the performance compared with CPUs. Our CG method utilizes SP (Splitting-Up) preconditioner, which is suitable for parallel processing because other dimensions except for one dimension are independent. In order to enhance the memory bandwidth to the global memory of GPUs, our implementation utilizes a pseudo matrix transposition before and after a tridiagonal matrix solver, which results in coalesced memory accesses. In addition, the number of pseudo matrix transpositions can be reduced to only one by using a rotation configuration technique. By these techniques, the speedups of our approach can be enhanced by up to 102.2%.