XVDPU: A High Performance CNN Accelerator on Versal Platform Powered by AI Engine

IF 3.1 4区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-09-13 DOI:10.1145/3617836

Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan

{"title":"XVDPU: A High Performance CNN Accelerator on Versal Platform Powered by AI Engine","authors":"Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1145/3617836","DOIUrl":null,"url":null,"abstract":"Nowadays, convolution neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this paper, we propose XVDPU: the AI-Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve IO bottleneck, we adopt several techniques to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) is further proposed which can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8 × faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4050 FPS. We propose a tilling strategy to achieve feature-map-stationary (FMS) for high-definition CNN (HD-CNN) with the accelerator, achieving 3.8 × FPS improvement on Residual Channel Attention Network (RCAN) and 3.1 × on Super-Efficient Super-Resolution (SESR). This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end (E2E) performance of 10.1FPS with all the optimizations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"170 1","pages":"0"},"PeriodicalIF":3.1000,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3617836","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, convolution neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this paper, we propose XVDPU: the AI-Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve IO bottleneck, we adopt several techniques to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) is further proposed which can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8 × faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4050 FPS. We propose a tilling strategy to achieve feature-map-stationary (FMS) for high-definition CNN (HD-CNN) with the accelerator, achieving 3.8 × FPS improvement on Residual Channel Attention Network (RCAN) and 3.1 × on Super-Efficient Super-Resolution (SESR). This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end (E2E) performance of 10.1FPS with all the optimizations.

查看原文本刊更多论文

xvppu:基于AI引擎的通用平台上的高性能CNN加速器

目前，卷积神经网络(cnn)在计算机视觉领域得到了广泛的应用。然而，更高的精度和更高的分辨率的趋势产生更大的网络。计算或I/O需求是关键的瓶颈。本文提出了基于AI-Engine (AIE)的通用芯片CNN加速器xvppu，以满足繁重的计算需求。为了解决IO瓶颈，我们采用了几种技术来改善数据重用和减少I/O需求。进一步提出了一种算术逻辑单元(ALU)，可以更好地平衡资源利用率、新特征支持和整个系统的效率。我们已经用我们的加速器成功部署了100多个CNN模型。实验结果表明，96核实现在VCK190上的ResNet50可以达到1653帧每秒(FPS)，比在ZCU102上运行的168.5帧每秒快9.8倍。256- ae核实现可以进一步达到4050 FPS。我们提出了一种利用加速器实现高清CNN (HD-CNN)特征映射静止(FMS)的耕作策略，在残差通道注意网络(RCAN)上实现3.8倍的FPS提升，在超高效超分辨率(SESR)上实现3.1倍的FPS提升。该加速器还可以解决视差估计中的3D卷积任务，通过所有优化，实现10.1FPS的端到端(E2E)性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.90

自引率

8.70%

发文量

审稿时长

>12 weeks

期刊介绍： TRETS is the top journal focusing on research in, on, and with reconfigurable systems and on their underlying technology. The scope, rationale, and coverage by other journals are often limited to particular aspects of reconfigurable technology or reconfigurable systems. TRETS is a journal that covers reconfigurability in its own right. Topics that would be appropriate for TRETS would include all levels of reconfigurable system abstractions and all aspects of reconfigurable technology including platforms, programming environments and application successes that support these systems for computing or other applications. -The board and systems architectures of a reconfigurable platform. -Programming environments of reconfigurable systems, especially those designed for use with reconfigurable systems that will lead to increased programmer productivity. -Languages and compilers for reconfigurable systems. -Logic synthesis and related tools, as they relate to reconfigurable systems. -Applications on which success can be demonstrated. The underlying technology from which reconfigurable systems are developed. (Currently this technology is that of FPGAs, but research on the nature and use of follow-on technologies is appropriate for TRETS.) In considering whether a paper is suitable for TRETS, the foremost question should be whether reconfigurability has been essential to success. Topics such as architecture, programming languages, compilers, and environments, logic synthesis, and high performance applications are all suitable if the context is appropriate. For example, an architecture for an embedded application that happens to use FPGAs is not necessarily suitable for TRETS, but an architecture using FPGAs for which the reconfigurability of the FPGAs is an inherent part of the specifications (perhaps due to a need for re-use on multiple applications) would be appropriate for TRETS.