利用动态自反平铺加速稀疏数据编排(扩展摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing Pub Date : 2023-07-18 DOI:10.1145/3597635.3598031

Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher

{"title":"利用动态自反平铺加速稀疏数据编排(扩展摘要)","authors":"Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher","doi":"10.1145/3597635.3598031","DOIUrl":null,"url":null,"abstract":"Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"95 Suppl A 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract)\",\"authors\":\"Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher\",\"doi\":\"10.1145/3597635.3598031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).\",\"PeriodicalId\":185981,\"journal\":{\"name\":\"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing\",\"volume\":\"95 Suppl A 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3597635.3598031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3597635.3598031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

涉及多个稀疏操作数的张量代数具有严重的内存限制，使其成为一个具有挑战性的加速目标。此外，不规则稀疏性使传统技术(如平铺)变得复杂，从而无法改善内存瓶颈。先前的稀疏平铺方案是不知道稀疏性的:它们将张量雕刻成统一的坐标空间形状，这导致了低占用瓷砖，从而降低了可利用的重用性。为了解决这些挑战，本文提出了动态反射平铺(DRT)，这是一种新的平铺方法，可以提高稀疏张量核的数据重用，从而释放出显着的性能改进机会。DRT的关键思想是动态稀疏感知平铺。DRT在运行时基于所有输入张量的活动区域的当前稀疏度连续地重新铺贴稀疏张量，以最大限度地提高加速器缓冲利用率，同时保留通过不同张量的块进行共迭代的能力。通过对一组SuiteSparse矩阵的广泛评估，我们展示了如何将DRT应用于具有不同数据流的多个先前的加速器(ExTensor、OuterSPACE、MatRaptor)，在增加可忽略的面积开销的同时提高它们的性能(分别提高3.3倍、5.1倍和1.6倍)。我们将DRT应用于高阶张量内核，在CPU实现和现有技术平铺方案上分别减少了3.9倍和16.9倍的DRAM流量。最后，我们证明了该技术可移植到软件中，与未执行稀疏稀疏矩阵乘法(SpMSpM)相比，该技术的内存开销分别提高了7.29倍和2.94倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract)

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

自引率

0.00%

发文量