Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He
{"title":"基于risc - v的GPGPU,具有高性能计算的矢量能力","authors":"Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He","doi":"10.1109/TVLSI.2025.3574427","DOIUrl":null,"url":null,"abstract":"General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on <uri>https://github.com/THU-DSP-LAB/ventus-gpgpu</uri>","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"2239-2251"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing\",\"authors\":\"Jingzhou Li;Fangfei Yu;Mingyuan Ma;Wei Liu;Yuhan Wang;Hualin Wu;Hu He\",\"doi\":\"10.1109/TVLSI.2025.3574427\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on <uri>https://github.com/THU-DSP-LAB/ventus-gpgpu</uri>\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 8\",\"pages\":\"2239-2251\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11048708/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11048708/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing
General-purpose graphics processing units (GPGPUs) have become a leading platform for accelerating modern compute-intensive applications, such as large language models and generative artificial intelligence (AI). However, the lack of advanced open-source GPGPU microarchitectures has hindered high-performance research in this area. In this article, we present Ventus, a high-performance open-source GPGPU implementation built upon the RISC-V architecture with vector extension [RISC-V vector (RVV)]. Ventus introduces customized instructions and a comprehensive software toolchain to optimize performance. We deployed the design on a field programmable gate array (FPGA) platform consisting of 4 Xilinx VU19P devices, scaling up to 16 streaming multiprocessors (SMs) and supporting 256 warps. Experimental results demonstrate that Ventus exhibits key performance features comparable to commercial GPGPUs, achieving an average of 83.9% instruction reduction and 87.4% cycle per instruction (CPI) improvement over the leading open-source alternatives. Under 4-, 8-, and 16-thread configurations, Ventus maintains robust instruction per cycle (IPC) performance with values of 0.47, 0.40, and 0.32, respectively. In addition, the tensor core of Ventus attains an extra average reduction of 69.1% in instruction count and a 68.4% cycle reduction ratio when running AI-related workloads. These findings highlight Ventus as a promising solution for future high-performance GPGPU research and development, offering a robust open-source alternative to proprietary solutions. Ventus can be found on https://github.com/THU-DSP-LAB/ventus-gpgpu
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.