DeMM:支持松弛结构稀疏性的解耦矩阵乘法引擎

IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos
{"title":"DeMM:支持松弛结构稀疏性的解耦矩阵乘法引擎","authors":"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/LCA.2024.3355178","DOIUrl":null,"url":null,"abstract":"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n:128, or \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n:256, for small values of \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \n<inline-formula><tex-math>$N$</tex-math></inline-formula>\n read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 1","pages":"17-20"},"PeriodicalIF":1.4000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity\",\"authors\":\"Christodoulos Peltekis;Vasileios Titopoulos;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos\",\"doi\":\"10.1109/LCA.2024.3355178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n:128, or \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n:256, for small values of \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n, targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and \\n<inline-formula><tex-math>$N$</tex-math></inline-formula>\\n read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.\",\"PeriodicalId\":51248,\"journal\":{\"name\":\"IEEE Computer Architecture Letters\",\"volume\":\"23 1\",\"pages\":\"17-20\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Computer Architecture Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10402073/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10402073/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

深度学习(DL)在各种应用领域取得了前所未有的成功。与此同时,模型剪枝已成为一种可行的解决方案,可减少移动应用中深度学习模型的占用空间,同时又不影响其准确性。为了使为密集 DL 模型构建的矩阵引擎也能处理经过剪枝的对应模型,经过剪枝的 DL 模型遵循 1:4 或 2:4 的细粒度结构稀疏性模式,即在每组四个连续值中,至少有一个或两个值必须为非零。最近,结构稀疏性也发展到了更粗糙(宽松)的 N:128 或 N:256(N 值很小)的情况,目标是为 DL 模型提供更宽的稀疏性范围(10%-90%)。在这项工作中,我们设计了一种加速器,通过构造,它可以在具有宽松结构稀疏性的宽块上运行。与传统的收缩阵列原型不同,新引擎将收缩阵列的内存部分与乘加单元解耦。内存块包括 1 个写入端口和 N 个读取端口,读取端口的数量等于每行非零元素的数量。乘加单元直接连接到每个读取端口,并按行先乘积的顺序完成乘法运算。更重要的是,简单的重新配置可实现更密集的模式。实验评估结果表明,与目前最先进的针对细粒度和宽松结构稀疏性而构建的收缩阵列引擎相比,延迟得到了大幅改善。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity
Deep Learning (DL) has achieved unprecedented success in various application domains. Meanwhile, model pruning has emerged as a viable solution to reduce the footprint of DL models in mobile applications, without compromising their accuracy. To enable the matrix engines built for dense DL models to also handle their pruned counterparts, pruned DL models follow a fine-grained structured sparsity pattern of 1:4, or 2:4, whereby in each group of four contiguous values, at least one, or two, respectively, must be non-zero. Structured sparsity has recently also moved to coarser (relaxed) cases of $N$ :128, or $N$ :256, for small values of $N$ , targeting a wider range of sparsity (10%-90%) for the DL models. In this work, we design an accelerator that operates, by construction, on wide blocks with relaxed structured sparsity. In contrast to the conventional systolic array archetype, the new engine decouples the memory part of the systolic array from the multiply-add units. The memory block comprises 1 write and $N$ read ports, with the number of read ports being equal to the number of non-zero elements per row. The multiply-add units connect directly to each read port and complete the multiplication in a row-wise product-first order. More importantly, simple reconfiguration facilitates more dense patterns. The experimental evaluation demonstrates substantial latency improvements over current state-of-the-art systolic array engines built for fine-grained and relaxed structured sparsity.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Computer Architecture Letters
IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
CiteScore
4.60
自引率
4.30%
发文量
29
期刊介绍: IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信