Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-01-24 DOI:10.1109/TC.2025.3533083

Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos

{"title":"Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors","authors":"Vasileios Titopoulos;Kosmas Alexandridis;Christodoulos Peltekis;Chrysostomos Nicopoulos;Giorgos Dimitrakopoulos","doi":"10.1109/TC.2025.3533083","DOIUrl":null,"url":null,"abstract":"Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called <monospace>vindexmac</monospace>, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1446-1460"},"PeriodicalIF":3.8000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10852517/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called vindexmac, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33%, when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.

查看原文本刊更多论文

RISC-V矢量处理器中结构稀疏矩阵乘法优化

结构化稀疏性已被提出作为一种有效的方法来减少机器学习（ML）应用程序的复杂性，并简化硬件中稀疏数据的处理。加速ML模型，无论是用于训练还是推理，都严重依赖于可以在向量处理器或自定义矩阵引擎上有效执行的矩阵乘法。这项工作旨在将结构化稀疏性的简单性集成到向量执行中，以加快相应的矩阵乘法。首先，全面探讨了利用当前RISC-V指令集向量扩展实现结构化稀疏矩阵乘法。影响性能的关键参数，例如跨标量和矢量寄存器文件的数据分布的影响、数据位置以及循环展开的有效性，都进行了定性和定量分析。此外，还证明了添加一条新指令将获得更高的性能。新提出的指令被称为vindexmac，即向量索引-乘法-累加。它允许从矢量寄存器文件中间接读取，并且减少了每次矩阵乘法迭代执行的指令数量，而不会引入限制循环展开的额外依赖项。提出的新指令集成在一个解耦的RISC-V矢量处理器中，硬件成本可以忽略不计。实验结果证明了所引入的优化方法所提供的运行效率和可扩展性，以及执行最新卷积神经网络的新指令。更具体地说，与只使用当前定义的RISC-V指令的高度优化的矢量化内核相比，添加自定义指令可以提高25%和33%的运行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.