{"title":"多核cpu上三对角化的缓存高效实现和批处理","authors":"Shuhei Kudo, Toshiyuki Imamura","doi":"10.1145/3293320.3293329","DOIUrl":null,"url":null,"abstract":"We herein propose an efficient implementation of tridiagonalization (TRD) for small matrices on manycore CPUs. Tridiagonalization is a matrix decomposition that is used as a preprocessor for eigenvalue computations. Further, TRD for such small matrices appears even in the HPC environment as a subproblem of large computations. To utilize the large cache memory of recent manycore CPUs, we reconstructed all parts of the implementation by introducing a systematic code generator to achieve performance portability and future extensibility. The flexibility of the system allows us to incorporate the \"BLAS+X\" approach, thereby improving the data reusability of the TRD algorithm and batching. The performance results indicate that our system outperforms the library implementations of TRD nearly twofold (or more for small matrices), on three different manycore CPUs: Fujitsu SPARC64, Intel Xeon, and Xeon Phi. As an extension, we also implemented the batching execution of TRD with a cache-aware scheduler on the top of our system. It not only doubles the peak performance at small matrices of n = O(100), but also improves it significantly up to n = O(1, 000), which is our target.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Cache-efficient implementation and batching of tridiagonalization on manycore CPUs\",\"authors\":\"Shuhei Kudo, Toshiyuki Imamura\",\"doi\":\"10.1145/3293320.3293329\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We herein propose an efficient implementation of tridiagonalization (TRD) for small matrices on manycore CPUs. Tridiagonalization is a matrix decomposition that is used as a preprocessor for eigenvalue computations. Further, TRD for such small matrices appears even in the HPC environment as a subproblem of large computations. To utilize the large cache memory of recent manycore CPUs, we reconstructed all parts of the implementation by introducing a systematic code generator to achieve performance portability and future extensibility. The flexibility of the system allows us to incorporate the \\\"BLAS+X\\\" approach, thereby improving the data reusability of the TRD algorithm and batching. The performance results indicate that our system outperforms the library implementations of TRD nearly twofold (or more for small matrices), on three different manycore CPUs: Fujitsu SPARC64, Intel Xeon, and Xeon Phi. As an extension, we also implemented the batching execution of TRD with a cache-aware scheduler on the top of our system. It not only doubles the peak performance at small matrices of n = O(100), but also improves it significantly up to n = O(1, 000), which is our target.\",\"PeriodicalId\":314778,\"journal\":{\"name\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3293320.3293329\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3293320.3293329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cache-efficient implementation and batching of tridiagonalization on manycore CPUs
We herein propose an efficient implementation of tridiagonalization (TRD) for small matrices on manycore CPUs. Tridiagonalization is a matrix decomposition that is used as a preprocessor for eigenvalue computations. Further, TRD for such small matrices appears even in the HPC environment as a subproblem of large computations. To utilize the large cache memory of recent manycore CPUs, we reconstructed all parts of the implementation by introducing a systematic code generator to achieve performance portability and future extensibility. The flexibility of the system allows us to incorporate the "BLAS+X" approach, thereby improving the data reusability of the TRD algorithm and batching. The performance results indicate that our system outperforms the library implementations of TRD nearly twofold (or more for small matrices), on three different manycore CPUs: Fujitsu SPARC64, Intel Xeon, and Xeon Phi. As an extension, we also implemented the batching execution of TRD with a cache-aware scheduler on the top of our system. It not only doubles the peak performance at small matrices of n = O(100), but also improves it significantly up to n = O(1, 000), which is our target.