{"title":"Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency","authors":"Matteo Perotti;Samuel Riedel;Matheus Cavalcante;Luca Benini","doi":"10.1109/TCAD.2025.3528349","DOIUrl":null,"url":null,"abstract":"The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64 bit floating-point-capable vector processor based on RISC-V’s vector extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dualcore vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries’ 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 GFLOPSDP/W at 1 GHz and nominal operating conditions (TT, 0.80 V, and 25 °C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 GFLOPSDP, and 99.3 GFLOPSDP/W (61% of the power spent in the FPU) on a 2D workload with <inline-formula> <tex-math>$7\\times 7$ </tex-math></inline-formula> kernel, resulting in an outstanding area/energy efficiency of 171 GFLOPSDP/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2488-2502"},"PeriodicalIF":2.7000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836838/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64 bit floating-point-capable vector processor based on RISC-V’s vector extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dualcore vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries’ 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 GFLOPSDP/W at 1 GHz and nominal operating conditions (TT, 0.80 V, and 25 °C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 GFLOPSDP, and 99.3 GFLOPSDP/W (61% of the power spent in the FPU) on a 2D workload with $7\times 7$ kernel, resulting in an outstanding area/energy efficiency of 171 GFLOPSDP/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.