{"title":"Implementing Sparse Linear Algebra Kernels on the Lucata Pathfinder-A Computer","authors":"Géraud Krawezik, Shannon K. Kuntz, P. Kogge","doi":"10.1109/HPEC43674.2020.9286207","DOIUrl":null,"url":null,"abstract":"We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture. The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss-Seidel (SymGS) method. Both were chosen as they account for the largest run time of the HPCG benchmark. We introduce the system used for the experiments, as well as its programming model and key aspects to get the most performance from it. We describe the data distribution used to allow an efficient parallelization of the algorithms, and their actual implementations. We then present hardware results and simulator traces to explain their behavior. We show an almost linear strong scaling with the code, and discuss future work and improvements.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture. The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss-Seidel (SymGS) method. Both were chosen as they account for the largest run time of the HPCG benchmark. We introduce the system used for the experiments, as well as its programming model and key aspects to get the most performance from it. We describe the data distribution used to allow an efficient parallelization of the algorithms, and their actual implementations. We then present hardware results and simulator traces to explain their behavior. We show an almost linear strong scaling with the code, and discuss future work and improvements.