{"title":"一种可变形卷积网络的内存高效硬件架构","authors":"Yue Yu, Jiapeng Luo, W. Mao, Zhongfeng Wang","doi":"10.1109/SiPS52927.2021.00033","DOIUrl":null,"url":null,"abstract":"In recent years, deformable convolutional networks are widely adopted in object detection tasks and have achieved outstanding performance. Compared with conventional convolution, the deformable convolution has an irregular receptive field to adapt to objects with different sizes and shapes. However, the irregularity of the receptive field causes inefficient access to memory and increases the complexity of control logic. Toward hardware-friendly implementation, prior works change the characteristics of deformable convolution by restricting the receptive field, leading to accuracy degradation. In this paper, we develop a dedicated Sampling Core to sample and rearrange the input pixels, enabling the convolution array to access the inputs regularly. In addition, a memory-efficient dataflow is introduced to match the processing speed of the Sampling Core and convolutional array, which improves hardware utilization and reduces access to off-chip memory. Based on these optimizations, we propose a novel hardware architecture for the deformable convolution network, which is the first work to accelerate the original deformable convolution network. With the design of the memory-efficient architecture, the access to the off-chip memory is reduced significantly. We implement it on Xilinx Virtex-7 FPGA, and experiments show that the energy efficiency reaches 50.29 GOPS/W, which is 2.5 times higher compared with executing the same network on GPU.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"A Memory-Efficient Hardware Architecture for Deformable Convolutional Networks\",\"authors\":\"Yue Yu, Jiapeng Luo, W. Mao, Zhongfeng Wang\",\"doi\":\"10.1109/SiPS52927.2021.00033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, deformable convolutional networks are widely adopted in object detection tasks and have achieved outstanding performance. Compared with conventional convolution, the deformable convolution has an irregular receptive field to adapt to objects with different sizes and shapes. However, the irregularity of the receptive field causes inefficient access to memory and increases the complexity of control logic. Toward hardware-friendly implementation, prior works change the characteristics of deformable convolution by restricting the receptive field, leading to accuracy degradation. In this paper, we develop a dedicated Sampling Core to sample and rearrange the input pixels, enabling the convolution array to access the inputs regularly. In addition, a memory-efficient dataflow is introduced to match the processing speed of the Sampling Core and convolutional array, which improves hardware utilization and reduces access to off-chip memory. Based on these optimizations, we propose a novel hardware architecture for the deformable convolution network, which is the first work to accelerate the original deformable convolution network. With the design of the memory-efficient architecture, the access to the off-chip memory is reduced significantly. We implement it on Xilinx Virtex-7 FPGA, and experiments show that the energy efficiency reaches 50.29 GOPS/W, which is 2.5 times higher compared with executing the same network on GPU.\",\"PeriodicalId\":103894,\"journal\":{\"name\":\"2021 IEEE Workshop on Signal Processing Systems (SiPS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Workshop on Signal Processing Systems (SiPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SiPS52927.2021.00033\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SiPS52927.2021.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Memory-Efficient Hardware Architecture for Deformable Convolutional Networks
In recent years, deformable convolutional networks are widely adopted in object detection tasks and have achieved outstanding performance. Compared with conventional convolution, the deformable convolution has an irregular receptive field to adapt to objects with different sizes and shapes. However, the irregularity of the receptive field causes inefficient access to memory and increases the complexity of control logic. Toward hardware-friendly implementation, prior works change the characteristics of deformable convolution by restricting the receptive field, leading to accuracy degradation. In this paper, we develop a dedicated Sampling Core to sample and rearrange the input pixels, enabling the convolution array to access the inputs regularly. In addition, a memory-efficient dataflow is introduced to match the processing speed of the Sampling Core and convolutional array, which improves hardware utilization and reduces access to off-chip memory. Based on these optimizations, we propose a novel hardware architecture for the deformable convolution network, which is the first work to accelerate the original deformable convolution network. With the design of the memory-efficient architecture, the access to the off-chip memory is reduced significantly. We implement it on Xilinx Virtex-7 FPGA, and experiments show that the energy efficiency reaches 50.29 GOPS/W, which is 2.5 times higher compared with executing the same network on GPU.