{"title":"超立方体上密集矩阵的线性代数","authors":"M. P. Sears","doi":"10.1109/DMCC.1990.555400","DOIUrl":null,"url":null,"abstract":"A set of routines has been written for dense matrix operations optimized for the NCUBE/6400 parallel processor. This work was motivated by a Sandia effort to parallelize certain electronic structure calculations [l]. Routines are included for matrix transpose, multiply, Cholesky decomposition, triangular inversion, and Householder tridiagonalization. The library is written in C and is callable from Fortran. Matrices up to order 1600 can be handled on 128 processors. For each operation, the algorithm used is presented along with typical timings and es timates of performance. Performance for order 1600 on 128 processors varies from 42 MFLOPs (Householder tridiagonalization, triangular inverse) up to 126 MFLOPs (matrix multiply). We also present performance results for communications and basic linear algebra operations (saxpy and dot products).","PeriodicalId":204431,"journal":{"name":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1990-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Linear Algebra for Dense Matrices on a Hypercube\",\"authors\":\"M. P. Sears\",\"doi\":\"10.1109/DMCC.1990.555400\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A set of routines has been written for dense matrix operations optimized for the NCUBE/6400 parallel processor. This work was motivated by a Sandia effort to parallelize certain electronic structure calculations [l]. Routines are included for matrix transpose, multiply, Cholesky decomposition, triangular inversion, and Householder tridiagonalization. The library is written in C and is callable from Fortran. Matrices up to order 1600 can be handled on 128 processors. For each operation, the algorithm used is presented along with typical timings and es timates of performance. Performance for order 1600 on 128 processors varies from 42 MFLOPs (Householder tridiagonalization, triangular inverse) up to 126 MFLOPs (matrix multiply). We also present performance results for communications and basic linear algebra operations (saxpy and dot products).\",\"PeriodicalId\":204431,\"journal\":{\"name\":\"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1990-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DMCC.1990.555400\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fifth Distributed Memory Computing Conference, 1990.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DMCC.1990.555400","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A set of routines has been written for dense matrix operations optimized for the NCUBE/6400 parallel processor. This work was motivated by a Sandia effort to parallelize certain electronic structure calculations [l]. Routines are included for matrix transpose, multiply, Cholesky decomposition, triangular inversion, and Householder tridiagonalization. The library is written in C and is callable from Fortran. Matrices up to order 1600 can be handled on 128 processors. For each operation, the algorithm used is presented along with typical timings and es timates of performance. Performance for order 1600 on 128 processors varies from 42 MFLOPs (Householder tridiagonalization, triangular inverse) up to 126 MFLOPs (matrix multiply). We also present performance results for communications and basic linear algebra operations (saxpy and dot products).