基于risc - v的GPGPU，具有高性能计算的矢量能力

IF 3.1 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-06-23 DOI:10.1109/TVLSI.2025.3574427

Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He

{"title":"基于risc - v的GPGPU，具有高性能计算的矢量能力","authors":"Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He","doi":"10.1109/TVLSI.2025.3574427","DOIUrl":null,"url":null,"abstract":"General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on <uri>https://github.com/THU-DSP-LAB/ventus-gpgpu</uri>","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2239-2251"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing\",\"authors\":\"Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He\",\"doi\":\"10.1109/TVLSI.2025.3574427\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on <uri>https://github.com/THU-DSP-LAB/ventus-gpgpu</uri>\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 8\",\"pages\":\"2239-2251\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11048708/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11048708/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

通用图形处理单元（gpgpu）已经成为加速现代计算密集型应用（如大型语言模型和生成式人工智能（AI））的领先平台。然而，缺乏先进的开源GPGPU微架构阻碍了这一领域的高性能研究。在本文中，我们介绍了Ventus，一种基于RISC-V架构的高性能开源GPGPU实现，带有向量扩展[RISC-V vector (RVV)]。Ventus引入定制指令和全面的软件工具链来优化性能。我们将该设计部署在现场可编程门阵列（FPGA）平台上，该平台由4个Xilinx VU19P设备组成，可扩展到16个流多处理器（SMs）并支持256次warp。实验结果表明，Ventus具有与商用gpgpu相当的关键性能特征，与领先的开源替代方案相比，平均减少83.9%的指令和87.4%的每指令周期（CPI）改进。在4、8和16线程配置下，Ventus保持了稳健的指令周期（IPC）性能，其值分别为0.47、0.40和0.32。此外，Ventus的张量核在运行与ai相关的工作负载时，指令数的平均减少率为69.1%，周期减少率为68.4%。这些发现突出了Ventus作为未来高性能GPGPU研发的一个有前途的解决方案，为专有解决方案提供了一个强大的开源替代方案。Ventus可以在https://github.com/THU-DSP-LAB/ventus-gpgpu上找到

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing

General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on https://github.com/THU-DSP-LAB/ventus-gpgpu

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.