{"title":"VTA中ReLU和upSample函数与Store的融合提高了网络推理的吞吐量","authors":"Dong Gou, Zheng He, Yinghai Zhao, Xuhui Wang, Yanshuo Gao, Guohe Zhang, Kuizhi Mei","doi":"10.1049/ell2.70284","DOIUrl":null,"url":null,"abstract":"<p>The TVM–versatile tensor accelerator (VTA) stack combines hardware–software co-design with operator-level optimizations but relies on ARM processors for auxiliary functions like ReLU and upSample, causing data-transfer bottlenecks and inefficiencies. To address this, we propose fusion VTA (FVTA), integrating ReLU and upSample into the RTL-based Store module with a newly designed instruction set and lightweight C++ runtime. This ensures seamless compatibility with existing VTA modules and eliminates ARM dependence. Evaluated on YOLOv3 with a Xilinx ZCU104 board, FVTA achieves a 195 ms frame processing time for 256 × 256 RGB images—4% faster than EVTA. This work highlights how combining the flexible TVM–VTA stack with optimized circuit-level design can significantly enhance inference efficiency.</p>","PeriodicalId":11556,"journal":{"name":"Electronics Letters","volume":"61 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ell2.70284","citationCount":"0","resultStr":"{\"title\":\"A Fusion of ReLU and upSample Function With Store in VTA for Higher Throughput of Network Inferencing\",\"authors\":\"Dong Gou, Zheng He, Yinghai Zhao, Xuhui Wang, Yanshuo Gao, Guohe Zhang, Kuizhi Mei\",\"doi\":\"10.1049/ell2.70284\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The TVM–versatile tensor accelerator (VTA) stack combines hardware–software co-design with operator-level optimizations but relies on ARM processors for auxiliary functions like ReLU and upSample, causing data-transfer bottlenecks and inefficiencies. To address this, we propose fusion VTA (FVTA), integrating ReLU and upSample into the RTL-based Store module with a newly designed instruction set and lightweight C++ runtime. This ensures seamless compatibility with existing VTA modules and eliminates ARM dependence. Evaluated on YOLOv3 with a Xilinx ZCU104 board, FVTA achieves a 195 ms frame processing time for 256 × 256 RGB images—4% faster than EVTA. This work highlights how combining the flexible TVM–VTA stack with optimized circuit-level design can significantly enhance inference efficiency.</p>\",\"PeriodicalId\":11556,\"journal\":{\"name\":\"Electronics Letters\",\"volume\":\"61 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ell2.70284\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Electronics Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/ell2.70284\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics Letters","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/ell2.70284","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Fusion of ReLU and upSample Function With Store in VTA for Higher Throughput of Network Inferencing
The TVM–versatile tensor accelerator (VTA) stack combines hardware–software co-design with operator-level optimizations but relies on ARM processors for auxiliary functions like ReLU and upSample, causing data-transfer bottlenecks and inefficiencies. To address this, we propose fusion VTA (FVTA), integrating ReLU and upSample into the RTL-based Store module with a newly designed instruction set and lightweight C++ runtime. This ensures seamless compatibility with existing VTA modules and eliminates ARM dependence. Evaluated on YOLOv3 with a Xilinx ZCU104 board, FVTA achieves a 195 ms frame processing time for 256 × 256 RGB images—4% faster than EVTA. This work highlights how combining the flexible TVM–VTA stack with optimized circuit-level design can significantly enhance inference efficiency.
期刊介绍:
Electronics Letters is an internationally renowned peer-reviewed rapid-communication journal that publishes short original research papers every two weeks. Its broad and interdisciplinary scope covers the latest developments in all electronic engineering related fields including communication, biomedical, optical and device technologies. Electronics Letters also provides further insight into some of the latest developments through special features and interviews.
Scope
As a journal at the forefront of its field, Electronics Letters publishes papers covering all themes of electronic and electrical engineering. The major themes of the journal are listed below.
Antennas and Propagation
Biomedical and Bioinspired Technologies, Signal Processing and Applications
Control Engineering
Electromagnetism: Theory, Materials and Devices
Electronic Circuits and Systems
Image, Video and Vision Processing and Applications
Information, Computing and Communications
Instrumentation and Measurement
Microwave Technology
Optical Communications
Photonics and Opto-Electronics
Power Electronics, Energy and Sustainability
Radar, Sonar and Navigation
Semiconductor Technology
Signal Processing
MIMO