三角矩阵反演图形处理单元

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis Pub Date : 2009-11-14 DOI:10.1145/1654059.1654069

F. Ries, T. DeMarco, Matteo Zivieri, R. Guerrieri

{"title":"三角矩阵反演图形处理单元","authors":"F. Ries, T. DeMarco, Matteo Zivieri, R. Guerrieri","doi":"10.1145/1654059.1654069","DOIUrl":null,"url":null,"abstract":"Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":"{\"title\":\"Triangular matrix inversion on Graphics Processing Unit\",\"authors\":\"F. Ries, T. DeMarco, Matteo Zivieri, R. Guerrieri\",\"doi\":\"10.1145/1654059.1654069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.\",\"PeriodicalId\":371415,\"journal\":{\"name\":\"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"33\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1654059.1654069\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1654059.1654069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

摘要

密集矩阵反演是许多线性代数算法中的一个基本过程。在大多数密集矩阵反演方法中，计算难度很大的一个步骤是由LU分解等因式分解方法产生的三角矩阵的反演。在本文中，我们演示了如何通过在标准PC中使用商用图形处理单元(GPU)来大大加速三角矩阵反演(TMI)。我们的实现是基于分而治之的递归TMI算法，有效地适应GPU架构。与基于cpu的LAPACK参考例程相比，我们的实现获得了34倍的加速，并且在GTX 280上以双精度运行高达54 gigaflops/s。讨论了该算法的局限性，并介绍了相应的解决策略。此外，我们还展示了如何在基于GTX 295的双gpu系统上以高达90 gigaflops/s的速度同时执行L矩阵和u矩阵的反转。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Triangular matrix inversion on Graphics Processing Unit

Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. Our implementation obtains a speedup of 34x versus a CPU-based LAPACK reference routine, and runs at up to 54 gigaflops/s on a GTX 280 in double precision. Limitations of the algorithm are discussed, and strategies to cope with them are introduced. In addition, we show how inversion of an L- and U-matrix can be performed concurrently on a GTX 295 based dual-GPU system at up to 90 gigaflops/s.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

自引率

0.00%

发文量