Strassen多收缩阵列硬件架构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-02-10 DOI:10.1109/TVLSI.2025.3530785

Trevor E. Pogue;Nicola Nicolici

{"title":"Strassen多收缩阵列硬件架构","authors":"Trevor E. Pogue;Nicola Nicolici","doi":"10.1109/TVLSI.2025.3530785","DOIUrl":null,"url":null,"abstract":"While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of whether it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or whether they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of <inline-formula> <tex-math>$1.14^{r}$ </tex-math></inline-formula> for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to <inline-formula> <tex-math>$32\\times 32$ </tex-math></inline-formula> and <inline-formula> <tex-math>$24\\times 24$ </tex-math></inline-formula> at one to two levels of Strassen recursion, respectively. We evaluate the proposed designs in both isolation and an end-to-end machine learning accelerator compared with baseline designs and prior works, achieving state-of-the-art performance.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1323-1333"},"PeriodicalIF":2.8000,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Strassen Multisystolic Array Hardware Architectures\",\"authors\":\"Trevor E. Pogue;Nicola Nicolici\",\"doi\":\"10.1109/TVLSI.2025.3530785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of whether it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or whether they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of <inline-formula> <tex-math>$1.14^{r}$ </tex-math></inline-formula> for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to <inline-formula> <tex-math>$32\\\\times 32$ </tex-math></inline-formula> and <inline-formula> <tex-math>$24\\\\times 24$ </tex-math></inline-formula> at one to two levels of Strassen recursion, respectively. We evaluate the proposed designs in both isolation and an end-to-end machine learning accelerator compared with baseline designs and prior works, achieving state-of-the-art performance.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 5\",\"pages\":\"1323-1333\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-02-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10879134/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10879134/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

虽然Strassen的矩阵乘法算法降低了朴素矩阵乘法的复杂性，但通用硬件并不适合实现该算法所承诺的理论加速。这就留下了一个问题，即是否可以在专门为执行算法而设计的定制硬件架构中更好地利用它。然而，在此之前的工作是有限的，目前还不清楚如何派生这样的架构，或者它们最终是否会导致真正的改进。我们弥合了这一差距，提出并评估了新的收缩阵列架构，这些架构有效地将Strassen算法的理论复杂性降低直接转化为硬件资源节约。此外，该架构是多收缩阵列设计，可以将较小的矩阵相乘，比单收缩阵列设计具有更高的利用率。在FPGA上实现的设计将实现Strassen递归水平的DSP需求降低了$1.14^{r}$，否则在实例化时需要总体相似的软逻辑资源，以支持矩阵大小分别在一到两个Strassen递归水平上降低到$32\ × 32$和$24\ × 24$。与基线设计和先前的工作相比，我们在隔离和端到端机器学习加速器中评估了拟议的设计，实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Strassen Multisystolic Array Hardware Architectures

While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of whether it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or whether they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of

$1.14^{r}$

for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to

$32\times 32$

and

$24\times 24$

at one to two levels of Strassen recursion, respectively. We evaluate the proposed designs in both isolation and an end-to-end machine learning accelerator compared with baseline designs and prior works, achieving state-of-the-art performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.