EGEMM-TC:扩展精度的张量核加速科学计算

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI:10.1145/3437801.3441599

Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, Yufei Ding

{"title":"EGEMM-TC:扩展精度的张量核加速科学计算","authors":"Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, Yufei Ding","doi":"10.1145/3437801.3441599","DOIUrl":null,"url":null,"abstract":"Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. In this paper, we build Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements. First, EGEMM-TC employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision. Second, EGEMM-TC exploits a set of Tensor Core kernel optimizations to achieve high performance, including the highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access. Third, EGEMM-TC incorporates a hardware-aware analytic model to offer large flexibility for automatic performance tuning across various scientific computing workloads and input datasets. Extensive evaluations show that EGEMM-TC can achieve on average 3.13× and 11.18× speedup over the cuBLAS kernels and the CUDA-SDK kernels on CUDA Cores, respectively. Our case study on several scientific computing applications further confirms that EGEMM-TC can generalize the usage of Tensor Cores and achieve about 1.8× speedup compared to the hand-tuned, highly-optimized implementations running on CUDA Cores.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":"{\"title\":\"EGEMM-TC: accelerating scientific computing on tensor cores with extended precision\",\"authors\":\"Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, Yufei Ding\",\"doi\":\"10.1145/3437801.3441599\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. In this paper, we build Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements. First, EGEMM-TC employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision. Second, EGEMM-TC exploits a set of Tensor Core kernel optimizations to achieve high performance, including the highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access. Third, EGEMM-TC incorporates a hardware-aware analytic model to offer large flexibility for automatic performance tuning across various scientific computing workloads and input datasets. Extensive evaluations show that EGEMM-TC can achieve on average 3.13× and 11.18× speedup over the cuBLAS kernels and the CUDA-SDK kernels on CUDA Cores, respectively. Our case study on several scientific computing applications further confirms that EGEMM-TC can generalize the usage of Tensor Cores and achieve about 1.8× speedup compared to the hand-tuned, highly-optimized implementations running on CUDA Cores.\",\"PeriodicalId\":124852,\"journal\":{\"name\":\"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming\",\"volume\":\"150 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"27\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3437801.3441599\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437801.3441599","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

摘要

Nvidia Tensor Cores通过针对深度学习工作负载量身定制的半精度矩阵输入实现高性能。然而，这限制了张量核的应用，特别是在对精度要求很高的科学计算领域。在本文中，我们在张量核上构建仿真GEMM (EGEMM-TC)，以扩展张量核的使用，在不影响精度要求的情况下加速科学计算应用。首先，EGEMM-TC采用可扩展的硬件分析和操作设计工作流，在张量核上生成具有扩展精度的轻量级仿真算法。其次，EGEMM-TC利用一组Tensor Core内核优化来实现高性能，包括利用Tensor Core内存架构的高效张化和协调仿真计算和内存访问的指令级优化。第三，EGEMM-TC结合了一个硬件感知的分析模型，为跨各种科学计算工作负载和输入数据集的自动性能调优提供了很大的灵活性。广泛的评估表明，EGEMM-TC在CUDA内核上的cuBLAS内核和CUDA- sdk内核分别可以实现3.13倍和11.18倍的平均加速。我们对几个科学计算应用的案例研究进一步证实，EGEMM-TC可以推广张量核心的使用，与在CUDA核心上运行的手动调整、高度优化的实现相比，可以实现大约1.8倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision

Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. In this paper, we build Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements. First, EGEMM-TC employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision. Second, EGEMM-TC exploits a set of Tensor Core kernel optimizations to achieve high performance, including the highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access. Third, EGEMM-TC incorporates a hardware-aware analytic model to offer large flexibility for automatic performance tuning across various scientific computing workloads and input datasets. Extensive evaluations show that EGEMM-TC can achieve on average 3.13× and 11.18× speedup over the cuBLAS kernels and the CUDA-SDK kernels on CUDA Cores, respectively. Our case study on several scientific computing applications further confirms that EGEMM-TC can generalize the usage of Tensor Cores and achieve about 1.8× speedup compared to the hand-tuned, highly-optimized implementations running on CUDA Cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量