{"title":"阿米巴:基于 FPGA 的高效灵活的任意核 CNN 加速器","authors":"Xiao Wu;Miaoxin Wang;Jun Lin;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3383871","DOIUrl":null,"url":null,"abstract":"Inspired by the key operation of vision transformers (ViTs), convolutional neural networks (CNNs) have widely adopted arbitrary-kernel convolutions to achieve high performance in diverse vision-based tasks. However, existing hardware efforts primarily focus on implementing CNN models that consist of a stack of small kernels, which poses challenges in supporting large-kernel convolutions. To address this limitation, we propose Amoeba, a flexible field-programmable gate array (FPGA)-based inference accelerator designed for efficiently supporting CNNs with arbitrary kernel sizes. Specifically, we present an optimized dataflow approach in collaboration with the Z-flow method and kernel-segmentation (Kseg) scheme, which enables flexible support for arbitrary-kernel convolutions without sacrificing efficiency. Additionally, we incorporate vertical-fused (VF) and horizontal-fused (HF) methods into the layer execution schedule to optimize the computation and data transfer process. To further enhance the CNN deployment performance, we employ the loop tiling scheme search (LTSS) method, guided by a fine-grained performance model, during the early design phase. The proposed Amoeba accelerator is evaluated on Intel Arria 10 SoC FPGA. The experimental results demonstrate excellent performance on prevalent and emerging CNNs, achieving a throughput of up to 286.2 GOPs. Notably, Amoeba achieves \n<inline-formula> <tex-math>$4.36\\times $ </tex-math></inline-formula>\n better DSP efficiency compared to prior works on the same network, highlighting its superior utilization of hardware resources for CNN inference tasks.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":null,"pages":null},"PeriodicalIF":2.8000,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNs\",\"authors\":\"Xiao Wu;Miaoxin Wang;Jun Lin;Zhongfeng Wang\",\"doi\":\"10.1109/TVLSI.2024.3383871\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inspired by the key operation of vision transformers (ViTs), convolutional neural networks (CNNs) have widely adopted arbitrary-kernel convolutions to achieve high performance in diverse vision-based tasks. However, existing hardware efforts primarily focus on implementing CNN models that consist of a stack of small kernels, which poses challenges in supporting large-kernel convolutions. To address this limitation, we propose Amoeba, a flexible field-programmable gate array (FPGA)-based inference accelerator designed for efficiently supporting CNNs with arbitrary kernel sizes. Specifically, we present an optimized dataflow approach in collaboration with the Z-flow method and kernel-segmentation (Kseg) scheme, which enables flexible support for arbitrary-kernel convolutions without sacrificing efficiency. Additionally, we incorporate vertical-fused (VF) and horizontal-fused (HF) methods into the layer execution schedule to optimize the computation and data transfer process. To further enhance the CNN deployment performance, we employ the loop tiling scheme search (LTSS) method, guided by a fine-grained performance model, during the early design phase. The proposed Amoeba accelerator is evaluated on Intel Arria 10 SoC FPGA. The experimental results demonstrate excellent performance on prevalent and emerging CNNs, achieving a throughput of up to 286.2 GOPs. Notably, Amoeba achieves \\n<inline-formula> <tex-math>$4.36\\\\times $ </tex-math></inline-formula>\\n better DSP efficiency compared to prior works on the same network, highlighting its superior utilization of hardware resources for CNN inference tasks.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10508998/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10508998/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNs
Inspired by the key operation of vision transformers (ViTs), convolutional neural networks (CNNs) have widely adopted arbitrary-kernel convolutions to achieve high performance in diverse vision-based tasks. However, existing hardware efforts primarily focus on implementing CNN models that consist of a stack of small kernels, which poses challenges in supporting large-kernel convolutions. To address this limitation, we propose Amoeba, a flexible field-programmable gate array (FPGA)-based inference accelerator designed for efficiently supporting CNNs with arbitrary kernel sizes. Specifically, we present an optimized dataflow approach in collaboration with the Z-flow method and kernel-segmentation (Kseg) scheme, which enables flexible support for arbitrary-kernel convolutions without sacrificing efficiency. Additionally, we incorporate vertical-fused (VF) and horizontal-fused (HF) methods into the layer execution schedule to optimize the computation and data transfer process. To further enhance the CNN deployment performance, we employ the loop tiling scheme search (LTSS) method, guided by a fine-grained performance model, during the early design phase. The proposed Amoeba accelerator is evaluated on Intel Arria 10 SoC FPGA. The experimental results demonstrate excellent performance on prevalent and emerging CNNs, achieving a throughput of up to 286.2 GOPs. Notably, Amoeba achieves
$4.36\times $
better DSP efficiency compared to prior works on the same network, highlighting its superior utilization of hardware resources for CNN inference tasks.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.