{"title":"Distributed O(N) Linear Solver for Dense Symmetric Hierarchical Semi-Separable Matrices","authors":"Chenhan D. Yu, Severin Reiz, G. Biros","doi":"10.1109/MCSoC.2019.00008","DOIUrl":null,"url":null,"abstract":"We present a distributed memory algorithm for the approximate hierarchical factorization of symmetric positive definite (SPD) matrices. Our method is based on the distributed memory GOFMM, an algorithm that appeared in SC18 (doi:10.1109/SC.2018.00018). GOFMM constructs a hierarchical matrix approximation of an arbitrary SPD matrix that compresses the matrix by creating low-rank approximations of the off-diagonal blocks. GOFMM method has no guarantees of success for arbitrary SPD matrices. (This is similar to the SVD; not every matrix admits a good low-rank approximation.) But for many SPD matrices, GOFMM does enable compression that results in fast matrix-vector multiplication that can reach N logN time—as opposed to N2 required for a dense matrix. GOFMM supports shared and distributed memory parallelism. In this paper, we build an approximate \"ULV\" factorization based on the Hierarchically Semi-Separable (HSS) compression of the GOFMM. This factorization requires O(N) work (given the compressed matrix) and O(N=p) + O(log p) time on p MPI processes (assuming a hypercube topology). The previous state-of-the-art required O(N logN) work. We present the factorization algorithm, discuss its complexity, and present weak and strong scaling results for the \"factorization\" and \"solve\" phases of our algorithm. We also discuss the performance of the inexact ULV factorization as a preconditioner for a few exemplary large dense linear systems. In our largest run, we were able to factorize a 67M-by-67M matrix in less than one second; and solve a system with 64 right-hand sides in less than one-tenth of a second. This run was on 6,144 Intel \"Skylake\" cores on the SKX partition of the Stampede2 system at the Texas Advanced Computing Center.","PeriodicalId":104240,"journal":{"name":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MCSoC.2019.00008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
We present a distributed memory algorithm for the approximate hierarchical factorization of symmetric positive definite (SPD) matrices. Our method is based on the distributed memory GOFMM, an algorithm that appeared in SC18 (doi:10.1109/SC.2018.00018). GOFMM constructs a hierarchical matrix approximation of an arbitrary SPD matrix that compresses the matrix by creating low-rank approximations of the off-diagonal blocks. GOFMM method has no guarantees of success for arbitrary SPD matrices. (This is similar to the SVD; not every matrix admits a good low-rank approximation.) But for many SPD matrices, GOFMM does enable compression that results in fast matrix-vector multiplication that can reach N logN time—as opposed to N2 required for a dense matrix. GOFMM supports shared and distributed memory parallelism. In this paper, we build an approximate "ULV" factorization based on the Hierarchically Semi-Separable (HSS) compression of the GOFMM. This factorization requires O(N) work (given the compressed matrix) and O(N=p) + O(log p) time on p MPI processes (assuming a hypercube topology). The previous state-of-the-art required O(N logN) work. We present the factorization algorithm, discuss its complexity, and present weak and strong scaling results for the "factorization" and "solve" phases of our algorithm. We also discuss the performance of the inexact ULV factorization as a preconditioner for a few exemplary large dense linear systems. In our largest run, we were able to factorize a 67M-by-67M matrix in less than one second; and solve a system with 64 right-hand sides in less than one-tenth of a second. This run was on 6,144 Intel "Skylake" cores on the SKX partition of the Stampede2 system at the Texas Advanced Computing Center.