SIMD2:一个广义矩阵指令集,用于加速超出GEMM的张量计算

Yunan Zhang, Po-An Tsai, Hung-Wei Tseng
{"title":"SIMD2:一个广义矩阵指令集,用于加速超出GEMM的张量计算","authors":"Yunan Zhang, Po-An Tsai, Hung-Wei Tseng","doi":"10.1145/3470496.3527411","DOIUrl":null,"url":null,"abstract":"Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM\",\"authors\":\"Yunan Zhang, Po-An Tsai, Hung-Wei Tseng\",\"doi\":\"10.1145/3470496.3527411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.\",\"PeriodicalId\":337932,\"journal\":{\"name\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3470496.3527411\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527411","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

矩阵乘法单元(mxu)现在在每个计算平台中都很流行。使mxu如此成功的关键属性是半循环结构,它允许平铺并行性和数据重用。尽管如此,矩阵乘法并不是唯一具有这种属性的算法。我们发现许多算法具有相同的结构,不同的只是核心操作;例如,使用add-minimum而不是乘法-add。因此,具有半环结构的算法有可能通过通用矩阵操作体系结构而不是普通的mxu来加速。本文提出了一种新的编程范式SIMD2,它支持半环结构下的广义矩阵运算。除了矩阵乘法之外,SIMD2指令还加速了另外八种矩阵运算。由于SIMD2指令类似于矩阵乘法指令,因此我们能够在任何MXU体系结构之上构建SIMD2体系结构,只需进行最小的修改。我们开发了一个使用NVIDIA gpu和Tensor Cores模拟和验证SIMD2的框架。在8个应用程序中,SIMD2提供高达38.59倍的加速,比优化后的CUDA程序平均提供超过6.94倍的加速,而全芯片面积开销仅为5%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM
Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信