{"title":"Optimizing PLASMA Eigensolver on Large Shared Memory Systems","authors":"Cheng Liao","doi":"10.1109/SCALA.2016.14","DOIUrl":null,"url":null,"abstract":"Performance of the PLASMA dense symmetric Eigensolver is optimized for large shared memory computer systems using multiple Householder domains for dense to band reduction and a communication reducing kernel for bulge chasing. The mr3-smp code by Petschow and Bientinesi is used for the tridiagonal eigensolution and the eigenvector back-transformations employ a 1D parallel decomposition. The input matrix, Householder vectors and scalars, are distributed among the CPU sockets with interleaved memory pages but the banded matrix, the eigenvectors, and temporary memory buffers are allocated and processed locally. Other considerations and optimization techniques also are presented. Numerical examples show the PLASMA eigensolver can out-perform ELPA and EIGENEXA significantly, for solving all the eigenpairs, if the problem size is sufficiently large, and the 2-stage eigensolution is generally better than its 1-stage counterpart on the latest x86_64 EP-4S CPUs with AVX2.","PeriodicalId":410521,"journal":{"name":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCALA.2016.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Performance of the PLASMA dense symmetric Eigensolver is optimized for large shared memory computer systems using multiple Householder domains for dense to band reduction and a communication reducing kernel for bulge chasing. The mr3-smp code by Petschow and Bientinesi is used for the tridiagonal eigensolution and the eigenvector back-transformations employ a 1D parallel decomposition. The input matrix, Householder vectors and scalars, are distributed among the CPU sockets with interleaved memory pages but the banded matrix, the eigenvectors, and temporary memory buffers are allocated and processed locally. Other considerations and optimization techniques also are presented. Numerical examples show the PLASMA eigensolver can out-perform ELPA and EIGENEXA significantly, for solving all the eigenpairs, if the problem size is sufficiently large, and the 2-stage eigensolution is generally better than its 1-stage counterpart on the latest x86_64 EP-4S CPUs with AVX2.