Ruby:通过不完全分解提高张量代数加速器的硬件效率

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI:10.1109/ispass55109.2022.00039

Mark Horeni, Pooria Taheri, Po-An Tsai, A. Parashar, J. Emer, S. Joshi

{"title":"Ruby:通过不完全分解提高张量代数加速器的硬件效率","authors":"Mark Horeni, Pooria Taheri, Po-An Tsai, A. Parashar, J. Emer, S. Joshi","doi":"10.1109/ispass55109.2022.00039","DOIUrl":null,"url":null,"abstract":"Finding high-quality mappings of Deep Neural Network (DNN) models onto tensor accelerators is critical for efficiency. State-of-the-art mapping exploration tools use remainderless (i.e., perfect) factorization to allocate hardware resources, through tiling the tensors, based on factors of tensor dimensions. This limits the size of the search space, (i.e., mapspace), but can lead to low resource utilization. We introduce a new mapspace, Ruby, that adds remainders (i.e., imperfect factorization) to expand the mapspace with high-quality mappings for user-defined architectures. This expansion allows us to allocate resources more precisely by generating tile sizes that better conform to hardware resources. However, this mapspace expansion also incurs an increase in the number of unique mappings. Consequently, this paper studies the trade-off between Ruby’s mapspace expansion and mapping quality. We propose Ruby-S (Spatial) to only employ imperfect factorization towards improved parallelism. Ruby-S incurs a moderate mapspace expansion while reducing energy-delay product (EDP) up to 50% when implementing ResNet-50 on an Eyeriss-like architecture with an average improvement of 20%. For the most part, this improvement can be attributed to higher compute utilization. EDP on a Simba-like architecture improves up to 40% with an average of 10%. For DeepBench workloads Ruby-S yields improvements of up to 45% with an average improvement of 10% on an Eyeriss-like architecture. Ruby-S is robust to accelerator configurations and improves EDP by 20% on average, with a maximum improvement of 55% when implementing ResNet-50 on different accelerator configurations. Ruby-S mappings form a new Pareto frontier, improving the performance of previous configurations by an average of 30% and 20% for ResNet-50 and DeepBench workloads respectively.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization\",\"authors\":\"Mark Horeni, Pooria Taheri, Po-An Tsai, A. Parashar, J. Emer, S. Joshi\",\"doi\":\"10.1109/ispass55109.2022.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finding high-quality mappings of Deep Neural Network (DNN) models onto tensor accelerators is critical for efficiency. State-of-the-art mapping exploration tools use remainderless (i.e., perfect) factorization to allocate hardware resources, through tiling the tensors, based on factors of tensor dimensions. This limits the size of the search space, (i.e., mapspace), but can lead to low resource utilization. We introduce a new mapspace, Ruby, that adds remainders (i.e., imperfect factorization) to expand the mapspace with high-quality mappings for user-defined architectures. This expansion allows us to allocate resources more precisely by generating tile sizes that better conform to hardware resources. However, this mapspace expansion also incurs an increase in the number of unique mappings. Consequently, this paper studies the trade-off between Ruby’s mapspace expansion and mapping quality. We propose Ruby-S (Spatial) to only employ imperfect factorization towards improved parallelism. Ruby-S incurs a moderate mapspace expansion while reducing energy-delay product (EDP) up to 50% when implementing ResNet-50 on an Eyeriss-like architecture with an average improvement of 20%. For the most part, this improvement can be attributed to higher compute utilization. EDP on a Simba-like architecture improves up to 40% with an average of 10%. For DeepBench workloads Ruby-S yields improvements of up to 45% with an average improvement of 10% on an Eyeriss-like architecture. Ruby-S is robust to accelerator configurations and improves EDP by 20% on average, with a maximum improvement of 55% when implementing ResNet-50 on different accelerator configurations. Ruby-S mappings form a new Pareto frontier, improving the performance of previous configurations by an average of 30% and 20% for ResNet-50 and DeepBench workloads respectively.\",\"PeriodicalId\":115391,\"journal\":{\"name\":\"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ispass55109.2022.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ispass55109.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在张量加速器上寻找高质量的深度神经网络(DNN)模型映射是提高效率的关键。最先进的地图勘探工具使用剩余(即完美)分解来分配硬件资源，通过平铺张量，基于张量维度的因素。这限制了搜索空间(即地图空间)的大小，但可能导致低资源利用率。我们引入了一个新的映射空间Ruby，它添加了余数(即不完全分解)，用用户定义架构的高质量映射扩展了映射空间。这种扩展允许我们通过生成更符合硬件资源的贴图大小来更精确地分配资源。但是，这种映射空间扩展也会导致惟一映射数量的增加。因此，本文研究了Ruby的映射空间扩展和映射质量之间的权衡。我们提出Ruby-S (Spatial)仅采用不完全分解来提高并行性。当在类似于eyeiss的架构上实现ResNet-50时，Ruby-S会产生适度的地图空间扩展，同时减少能量延迟产品(EDP)高达50%，平均改进20%。在很大程度上，这种改进可以归因于更高的计算利用率。类似simba的架构上的EDP提高了40%，平均提高了10%。对于DeepBench工作负载，Ruby-S的改进率高达45%，在类似eyeiss的架构上的平均改进率为10%。Ruby-S对加速器配置具有鲁棒性，平均可将EDP提高20%，在不同加速器配置上实现ResNet-50时，最大可提高55%。Ruby-S映射形成了一个新的帕累托边界，在ResNet-50和DeepBench工作负载下，比以前的配置性能平均分别提高了30%和20%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization

Finding high-quality mappings of Deep Neural Network (DNN) models onto tensor accelerators is critical for efficiency. State-of-the-art mapping exploration tools use remainderless (i.e., perfect) factorization to allocate hardware resources, through tiling the tensors, based on factors of tensor dimensions. This limits the size of the search space, (i.e., mapspace), but can lead to low resource utilization. We introduce a new mapspace, Ruby, that adds remainders (i.e., imperfect factorization) to expand the mapspace with high-quality mappings for user-defined architectures. This expansion allows us to allocate resources more precisely by generating tile sizes that better conform to hardware resources. However, this mapspace expansion also incurs an increase in the number of unique mappings. Consequently, this paper studies the trade-off between Ruby’s mapspace expansion and mapping quality. We propose Ruby-S (Spatial) to only employ imperfect factorization towards improved parallelism. Ruby-S incurs a moderate mapspace expansion while reducing energy-delay product (EDP) up to 50% when implementing ResNet-50 on an Eyeriss-like architecture with an average improvement of 20%. For the most part, this improvement can be attributed to higher compute utilization. EDP on a Simba-like architecture improves up to 40% with an average of 10%. For DeepBench workloads Ruby-S yields improvements of up to 45% with an average improvement of 10% on an Eyeriss-like architecture. Ruby-S is robust to accelerator configurations and improves EDP by 20% on average, with a maximum improvement of 55% when implementing ResNet-50 on different accelerator configurations. Ruby-S mappings form a new Pareto frontier, improving the performance of previous configurations by an average of 30% and 20% for ResNet-50 and DeepBench workloads respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

自引率

0.00%

发文量