在最新的神威处理器上自动生成高性能矩阵乘法核

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545031

Xiaohan Tao, Yuming Zhu, Bo-Hsuan Wang, Jinlong Xu, Jianmin Pang, Jie Zhao

{"title":"在最新的神威处理器上自动生成高性能矩阵乘法核","authors":"Xiaohan Tao, Yuming Zhu, Bo-Hsuan Wang, Jinlong Xu, Jianmin Pang, Jie Zhao","doi":"10.1145/3545008.3545031","DOIUrl":null,"url":null,"abstract":"We present an approach to the automatic generation of efficient matrix multiplication code on the latest Sunway processor, which will be employed by the next-generation machine of Sunway TaihuLight, one of the fastest supercomputers on earth. The method allows users to write simple C code and automatically generates high-performance matrix multiplication kernels. It uses polyhedral transformations to implement rapid compute decomposition, data exchanges across memory hierarchy and memory latency hiding. An assembly routine is finally integrated into the generated kernels. While achieving up to 90.14% of the theoretical peak performance, our method surpasses a highly tuned library by 9.44%. Compared with existing techniques, our approach reduces the software development life cycle to generate efficient matrix code from months to seconds. We also take into account batched matrix multiplication and some fusion patterns for deep learning (DL), outperforming the library-based implementations by 1.30 × and 1.67 ×.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor\",\"authors\":\"Xiaohan Tao, Yuming Zhu, Bo-Hsuan Wang, Jinlong Xu, Jianmin Pang, Jie Zhao\",\"doi\":\"10.1145/3545008.3545031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present an approach to the automatic generation of efficient matrix multiplication code on the latest Sunway processor, which will be employed by the next-generation machine of Sunway TaihuLight, one of the fastest supercomputers on earth. The method allows users to write simple C code and automatically generates high-performance matrix multiplication kernels. It uses polyhedral transformations to implement rapid compute decomposition, data exchanges across memory hierarchy and memory latency hiding. An assembly routine is finally integrated into the generated kernels. While achieving up to 90.14% of the theoretical peak performance, our method surpasses a highly tuned library by 9.44%. Compared with existing techniques, our approach reduces the software development life cycle to generate efficient matrix code from months to seconds. We also take into account batched matrix multiplication and some fusion patterns for deep learning (DL), outperforming the library-based implementations by 1.30 × and 1.67 ×.\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们提出了一种在最新的神威处理器上自动生成高效矩阵乘法代码的方法，该处理器将被用于地球上最快的超级计算机之一神威太湖之光的下一代机器。该方法允许用户编写简单的C代码并自动生成高性能的矩阵乘法内核。它使用多面体变换实现快速计算分解、跨内存层次的数据交换和内存延迟隐藏。最后将一个汇编例程集成到生成的内核中。在达到理论峰值性能的90.14%的同时，我们的方法比高度调优的库高出9.44%。与现有技术相比，我们的方法将生成高效矩阵代码的软件开发生命周期从几个月缩短到几秒钟。我们还考虑了批处理矩阵乘法和深度学习(DL)的一些融合模式，比基于库的实现性能高出1.30倍和1.67倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor

We present an approach to the automatic generation of efficient matrix multiplication code on the latest Sunway processor, which will be employed by the next-generation machine of Sunway TaihuLight, one of the fastest supercomputers on earth. The method allows users to write simple C code and automatically generates high-performance matrix multiplication kernels. It uses polyhedral transformations to implement rapid compute decomposition, data exchanges across memory hierarchy and memory latency hiding. An assembly routine is finally integrated into the generated kernels. While achieving up to 90.14% of the theoretical peak performance, our method surpasses a highly tuned library by 9.44%. Compared with existing techniques, our approach reduces the software development life cycle to generate efficient matrix code from months to seconds. We also take into account batched matrix multiplication and some fusion patterns for deep learning (DL), outperforming the library-based implementations by 1.30 × and 1.67 ×.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量