{"title":"Strassen多收缩阵列硬件架构","authors":"Trevor E. Pogue;Nicola Nicolici","doi":"10.1109/TVLSI.2025.3530785","DOIUrl":null,"url":null,"abstract":"While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of whether it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or whether they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of <inline-formula> <tex-math>$1.14^{r}$ </tex-math></inline-formula> for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to <inline-formula> <tex-math>$32\\times 32$ </tex-math></inline-formula> and <inline-formula> <tex-math>$24\\times 24$ </tex-math></inline-formula> at one to two levels of Strassen recursion, respectively. We evaluate the proposed designs in both isolation and an end-to-end machine learning accelerator compared with baseline designs and prior works, achieving state-of-the-art performance.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1323-1333"},"PeriodicalIF":2.8000,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Strassen Multisystolic Array Hardware Architectures\",\"authors\":\"Trevor E. Pogue;Nicola Nicolici\",\"doi\":\"10.1109/TVLSI.2025.3530785\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of whether it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or whether they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of <inline-formula> <tex-math>$1.14^{r}$ </tex-math></inline-formula> for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to <inline-formula> <tex-math>$32\\\\times 32$ </tex-math></inline-formula> and <inline-formula> <tex-math>$24\\\\times 24$ </tex-math></inline-formula> at one to two levels of Strassen recursion, respectively. We evaluate the proposed designs in both isolation and an end-to-end machine learning accelerator compared with baseline designs and prior works, achieving state-of-the-art performance.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 5\",\"pages\":\"1323-1333\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-02-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10879134/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10879134/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of whether it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or whether they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of $1.14^{r}$ for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to $32\times 32$ and $24\times 24$ at one to two levels of Strassen recursion, respectively. We evaluate the proposed designs in both isolation and an end-to-end machine learning accelerator compared with baseline designs and prior works, achieving state-of-the-art performance.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.