XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI:10.1109/FPL57034.2022.00041

Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, D. Xu, Hong Wang, Rongzhang Zheng, Satyaprakash Pareek, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan

{"title":"XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine","authors":"Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, D. Xu, Hong Wang, Rongzhang Zheng, Satyaprakash Pareek, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1109/FPL57034.2022.00041","DOIUrl":null,"url":null,"abstract":"The convolution neural networks (CNNs) are widely used in computer vision applications nowadays. However, the trends of higher accuracy and higher resolution generate larger networks, indicating that computation and I/O bandwidth are key bottlenecks to reach performance. The Xilinx's latest 7nm Versal ACAP platform with AI-Engine (AIE) cores can deliver up-to 8x silicon compute density at 50% the power consumption compared with the traditional FPGA solutions. In this paper, we propose XVDPU: the AIE-based int8-precision CNN accelerator on Versal chips, scaling from 16-AIE-core (C16B1) to 320-AIE-core (C64B5, Peak:109.2 TOPs) to meet computation requirements. To resolve IO bottleneck, we adopt several techniques such as multi-batch (MB), shared-weights (SHRWGT), feature-map-stationary (FMS) and long-load-weights (LLW) to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) design is further proposed into the accelerator which mainly performs non-convolution layers such as Depthwise-Conv layer, Pooling layer and Non-linear function layers using the same logic resources, which can better balance resource utilization, new feature support and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core (C32B3, Peak: 32.76 TOPs) implementation can achieve 1653 FPS for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS with peak 3.6 TOPs. The 256-AIE-core (C32B8, Peak: 87.36 TOPs) implementation can further achieve 4050 FPS which better leverages the computing power of Versal AIE devices. The powerful XVDPU will help enable many applications on the embedded system, such as low-latency data center, high level ADAS and complex robotics.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

The convolution neural networks (CNNs) are widely used in computer vision applications nowadays. However, the trends of higher accuracy and higher resolution generate larger networks, indicating that computation and I/O bandwidth are key bottlenecks to reach performance. The Xilinx's latest 7nm Versal ACAP platform with AI-Engine (AIE) cores can deliver up-to 8x silicon compute density at 50% the power consumption compared with the traditional FPGA solutions. In this paper, we propose XVDPU: the AIE-based int8-precision CNN accelerator on Versal chips, scaling from 16-AIE-core (C16B1) to 320-AIE-core (C64B5, Peak:109.2 TOPs) to meet computation requirements. To resolve IO bottleneck, we adopt several techniques such as multi-batch (MB), shared-weights (SHRWGT), feature-map-stationary (FMS) and long-load-weights (LLW) to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) design is further proposed into the accelerator which mainly performs non-convolution layers such as Depthwise-Conv layer, Pooling layer and Non-linear function layers using the same logic resources, which can better balance resource utilization, new feature support and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core (C32B3, Peak: 32.76 TOPs) implementation can achieve 1653 FPS for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS with peak 3.6 TOPs. The 256-AIE-core (C32B8, Peak: 87.36 TOPs) implementation can further achieve 4050 FPS which better leverages the computing power of Versal AIE devices. The powerful XVDPU will help enable many applications on the embedded system, such as low-latency data center, high level ADAS and complex robotics.

查看原文本刊更多论文

xxvpu:基于AI引擎的通用平台上的高性能CNN加速器

卷积神经网络在计算机视觉领域得到了广泛的应用。然而，随着精度和分辨率的不断提高，网络规模也越来越大，计算量和I/O带宽是实现性能的关键瓶颈。赛灵思最新的7nm Versal ACAP平台采用AI-Engine (AIE)内核，与传统FPGA解决方案相比，可提供高达8倍的硅计算密度，功耗仅为50%。在本文中，我们在通用芯片上提出了xvppu:基于ai的int8精度CNN加速器，从16核(C16B1)扩展到320核(C64B5，峰值:109.2 TOPs)以满足计算需求。为了解决IO瓶颈，我们采用了多批处理(MB)、共享权重(SHRWGT)、特征映射静止(FMS)和长负载权重(LLW)等技术来提高数据重用和降低I/O需求。在加速器中进一步提出了算术逻辑单元(ALU)的设计，该设计主要使用相同的逻辑资源执行深度卷积层、池化层和非线性功能层等非卷积层，可以更好地平衡资源利用率、新特征支持和整个系统的效率。我们已经用我们的加速器成功部署了100多个CNN模型。我们的实验结果表明，96核(C32B3，峰值:32.76 TOPs)实现在VCK190上ResNet50可以达到1653 FPS，比ZCU102上168.5 FPS和峰值3.6 TOPs的设计快9.8倍。256-AIE核(C32B8，峰值:87.36 TOPs)实现可以进一步达到4050 FPS，更好地利用了通用AIE设备的计算能力。强大的xxvpu将有助于实现嵌入式系统上的许多应用，例如低延迟数据中心，高级ADAS和复杂的机器人技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

自引率

0.00%

发文量