Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher
{"title":"利用动态自反平铺加速稀疏数据编排(扩展摘要)","authors":"Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher","doi":"10.1145/3597635.3598031","DOIUrl":null,"url":null,"abstract":"Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"95 Suppl A 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract)\",\"authors\":\"Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher\",\"doi\":\"10.1145/3597635.3598031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).\",\"PeriodicalId\":185981,\"journal\":{\"name\":\"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing\",\"volume\":\"95 Suppl A 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3597635.3598031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3597635.3598031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract)
Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).