A Memory-Efficient Hardware Architecture for Deformable Convolutional Networks

2021 IEEE Workshop on Signal Processing Systems (SiPS) Pub Date : 2021-10-01 DOI:10.1109/SiPS52927.2021.00033

Yue Yu, Jiapeng Luo, W. Mao, Zhongfeng Wang

{"title":"A Memory-Efficient Hardware Architecture for Deformable Convolutional Networks","authors":"Yue Yu, Jiapeng Luo, W. Mao, Zhongfeng Wang","doi":"10.1109/SiPS52927.2021.00033","DOIUrl":null,"url":null,"abstract":"In recent years, deformable convolutional networks are widely adopted in object detection tasks and have achieved outstanding performance. Compared with conventional convolution, the deformable convolution has an irregular receptive field to adapt to objects with different sizes and shapes. However, the irregularity of the receptive field causes inefficient access to memory and increases the complexity of control logic. Toward hardware-friendly implementation, prior works change the characteristics of deformable convolution by restricting the receptive field, leading to accuracy degradation. In this paper, we develop a dedicated Sampling Core to sample and rearrange the input pixels, enabling the convolution array to access the inputs regularly. In addition, a memory-efficient dataflow is introduced to match the processing speed of the Sampling Core and convolutional array, which improves hardware utilization and reduces access to off-chip memory. Based on these optimizations, we propose a novel hardware architecture for the deformable convolution network, which is the first work to accelerate the original deformable convolution network. With the design of the memory-efficient architecture, the access to the off-chip memory is reduced significantly. We implement it on Xilinx Virtex-7 FPGA, and experiments show that the energy efficiency reaches 50.29 GOPS/W, which is 2.5 times higher compared with executing the same network on GPU.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SiPS52927.2021.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In recent years, deformable convolutional networks are widely adopted in object detection tasks and have achieved outstanding performance. Compared with conventional convolution, the deformable convolution has an irregular receptive field to adapt to objects with different sizes and shapes. However, the irregularity of the receptive field causes inefficient access to memory and increases the complexity of control logic. Toward hardware-friendly implementation, prior works change the characteristics of deformable convolution by restricting the receptive field, leading to accuracy degradation. In this paper, we develop a dedicated Sampling Core to sample and rearrange the input pixels, enabling the convolution array to access the inputs regularly. In addition, a memory-efficient dataflow is introduced to match the processing speed of the Sampling Core and convolutional array, which improves hardware utilization and reduces access to off-chip memory. Based on these optimizations, we propose a novel hardware architecture for the deformable convolution network, which is the first work to accelerate the original deformable convolution network. With the design of the memory-efficient architecture, the access to the off-chip memory is reduced significantly. We implement it on Xilinx Virtex-7 FPGA, and experiments show that the energy efficiency reaches 50.29 GOPS/W, which is 2.5 times higher compared with executing the same network on GPU.

查看原文本刊更多论文

一种可变形卷积网络的内存高效硬件架构

近年来，可变形卷积网络被广泛应用于目标检测任务中，并取得了优异的性能。与常规卷积相比，可变形卷积具有不规则的接受野，可以适应不同大小和形状的物体。然而，接收野的不规则性导致对记忆的低效访问，并增加了控制逻辑的复杂性。对于硬件友好的实现，先前的工作通过限制接受场来改变可变形卷积的特性，导致精度下降。在本文中，我们开发了一个专用的采样核心来采样和重新排列输入像素，使卷积阵列能够定期访问输入。此外，引入了内存高效数据流来匹配采样核和卷积阵列的处理速度，从而提高了硬件利用率并减少了对片外存储器的访问。在此基础上，我们提出了一种新的可变形卷积网络硬件架构，这是对原有可变形卷积网络进行加速的首次工作。通过内存高效架构的设计，大大减少了对片外存储器的访问。我们在Xilinx Virtex-7 FPGA上实现了该算法，实验表明，该算法的能量效率达到50.29 GOPS/W，比在GPU上执行相同的网络提高了2.5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Workshop on Signal Processing Systems (SiPS)

自引率

0.00%

发文量