在VAX 6520/2VP上使用blas内核的列式块LU分解

Computing Systems in Engineering Pub Date : 1995-08-01 DOI:10.1016/0956-0521(95)00049-6

Paulo B. Vasconcelos , Filomena D. D'Almeida

{"title":"在VAX 6520/2VP上使用blas内核的列式块LU分解","authors":"Paulo B. Vasconcelos , Filomena D. D'Almeida","doi":"10.1016/0956-0521(95)00049-6","DOIUrl":null,"url":null,"abstract":"<div><p>The LU factorization of a matrix <em>A</em> is a widely used algorithm, for instance in the solution of linear systems <em>Ax</em> = <em>b</em>. The increasing capacities of high performance computers allow us to use direct methods for systems of large and dense matrices. To build portable and efficient LU codes for vector and parallel computers, this method is rewritten in block versions and BLAS (Basic Linear Algebra Subprograms) kernels are used to mask the architectural details and allow good performance of codes such as the LAPACK (Linear Algebra PACKage) library. In the references it was proved that this strategy leads to portability and efficiency of codes using tuned BLAS kernels. After a short description of the block versions we will present some results obtained on the VAX 6520/2VP, comparing the block algorithm versus point algorithm, and vectorized versions versus scalar versions. The three columnwise versions of the block algorithm showed similar performance for this computer and large matrix dimensions. The block size used is a crucial parameter for these algorithms and the results show that the best performance is obtained with block size 64 (for large matrices) which is the vector registered size of the machine used.</p></div>","PeriodicalId":100325,"journal":{"name":"Computing Systems in Engineering","volume":"6 4","pages":"Pages 423-429"},"PeriodicalIF":0.0000,"publicationDate":"1995-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/0956-0521(95)00049-6","citationCount":"0","resultStr":"{\"title\":\"Columnwise block LU factorization using blas kernels on VAX 6520/2VP\",\"authors\":\"Paulo B. Vasconcelos , Filomena D. D'Almeida\",\"doi\":\"10.1016/0956-0521(95)00049-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The LU factorization of a matrix <em>A</em> is a widely used algorithm, for instance in the solution of linear systems <em>Ax</em> = <em>b</em>. The increasing capacities of high performance computers allow us to use direct methods for systems of large and dense matrices. To build portable and efficient LU codes for vector and parallel computers, this method is rewritten in block versions and BLAS (Basic Linear Algebra Subprograms) kernels are used to mask the architectural details and allow good performance of codes such as the LAPACK (Linear Algebra PACKage) library. In the references it was proved that this strategy leads to portability and efficiency of codes using tuned BLAS kernels. After a short description of the block versions we will present some results obtained on the VAX 6520/2VP, comparing the block algorithm versus point algorithm, and vectorized versions versus scalar versions. The three columnwise versions of the block algorithm showed similar performance for this computer and large matrix dimensions. The block size used is a crucial parameter for these algorithms and the results show that the best performance is obtained with block size 64 (for large matrices) which is the vector registered size of the machine used.</p></div>\",\"PeriodicalId\":100325,\"journal\":{\"name\":\"Computing Systems in Engineering\",\"volume\":\"6 4\",\"pages\":\"Pages 423-429\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/0956-0521(95)00049-6\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computing Systems in Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/0956052195000496\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing Systems in Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/0956052195000496","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

矩阵a的LU分解是一种广泛使用的算法，例如在线性系统Ax = b的解中。高性能计算机的容量不断增加，使我们能够对大型和密集矩阵系统使用直接方法。为了为矢量和并行计算机构建可移植和高效的LU代码，该方法被重写为块版本，并使用BLAS(基本线性代数子程序)内核来掩盖体系结构细节，并允许诸如LAPACK(线性代数包)库之类的代码具有良好的性能。在参考文献中证明了这种策略可以使用调优的BLAS内核提高代码的可移植性和效率。在对块版本的简短描述之后，我们将展示在VAX 6520/2VP上获得的一些结果，比较块算法与点算法，矢量化版本与标量版本。对于这台计算机和大的矩阵维数，块算法的三个列式版本显示出相似的性能。使用的块大小是这些算法的关键参数，结果表明，块大小为64(对于大矩阵)时获得了最佳性能，这是所使用的机器的向量注册大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Columnwise block LU factorization using blas kernels on VAX 6520/2VP

The LU factorization of a matrix A is a widely used algorithm, for instance in the solution of linear systems Ax = b. The increasing capacities of high performance computers allow us to use direct methods for systems of large and dense matrices. To build portable and efficient LU codes for vector and parallel computers, this method is rewritten in block versions and BLAS (Basic Linear Algebra Subprograms) kernels are used to mask the architectural details and allow good performance of codes such as the LAPACK (Linear Algebra PACKage) library. In the references it was proved that this strategy leads to portability and efficiency of codes using tuned BLAS kernels. After a short description of the block versions we will present some results obtained on the VAX 6520/2VP, comparing the block algorithm versus point algorithm, and vectorized versions versus scalar versions. The three columnwise versions of the block algorithm showed similar performance for this computer and large matrix dimensions. The block size used is a crucial parameter for these algorithms and the results show that the best performance is obtained with block size 64 (for large matrices) which is the vector registered size of the machine used.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computing Systems in Engineering

自引率

0.00%

发文量