动态利用CNN Winograd变换稀疏性的最小化FPGA加速设计

2019 32nd IEEE International System-on-Chip Conference (SOCC) Pub Date : 2019-09-01 DOI:10.1109/SOCC46988.2019.1570558495

Xinkai Di, Haigang Yang, Zhihong Huang, Ning Mao

{"title":"动态利用CNN Winograd变换稀疏性的最小化FPGA加速设计","authors":"Xinkai Di, Haigang Yang, Zhihong Huang, Ning Mao","doi":"10.1109/SOCC46988.2019.1570558495","DOIUrl":null,"url":null,"abstract":"To address the challenges of high computational complexity incurred in deep convolutional neural networks (CNNs), implementations by both the Fast Winograd Transform algorithm and the sparsity exploration method have been attempted in order to reduce the hardware operation overhead. Yet, the previous studies have been mainly concentrated on dealing with the fixed sparsity patterns of the weight filter. In this paper, we focus the effort specifically towards exploiting the characteristics of varying sparsity patterns existing in the input/output Activations of the Winograd-transformed network. To this end, a dynamically compressing approach for multiplication with the sparsity-changing matrix is proposed. Such a processing flow features in data indexing and restoring. Because they are dynamically generated during the inference process, the inputs/outputs are highly dependent on the actual data being processed. Unlike the static pattern of a weight matrix just requiring the offline compression, a real-time compression processor module is devised and employed to deal with the dynamic matrix pattern for updating online the inputs/outputs within FPGAs Block RAMs. In the next layer computation, only the valid data needs to be restored by following the necessary index information and broadcasting to those corresponding sparse weight matrices, which in turn generates the next batch inputs/outputs. The design has realized a typical CNN such as VGG on Xilinx Virtex 7 FPGA device for verification and achieves an overall performance of 629.4 GOPS. Meanwhile, the preliminary experimental results demonstrate 2.2 (up to 5.5) times improvement in terms of equivalent GOPS per DSP Block achieved with our adaptive sparsity exploitation approach, when compared to the other conventional counterparts.","PeriodicalId":253998,"journal":{"name":"2019 32nd IEEE International System-on-Chip Conference (SOCC)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Operation-Minimized FPGA Accelerator Design by Dynamically Exploiting Sparsity in CNN Winograd Transform\",\"authors\":\"Xinkai Di, Haigang Yang, Zhihong Huang, Ning Mao\",\"doi\":\"10.1109/SOCC46988.2019.1570558495\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To address the challenges of high computational complexity incurred in deep convolutional neural networks (CNNs), implementations by both the Fast Winograd Transform algorithm and the sparsity exploration method have been attempted in order to reduce the hardware operation overhead. Yet, the previous studies have been mainly concentrated on dealing with the fixed sparsity patterns of the weight filter. In this paper, we focus the effort specifically towards exploiting the characteristics of varying sparsity patterns existing in the input/output Activations of the Winograd-transformed network. To this end, a dynamically compressing approach for multiplication with the sparsity-changing matrix is proposed. Such a processing flow features in data indexing and restoring. Because they are dynamically generated during the inference process, the inputs/outputs are highly dependent on the actual data being processed. Unlike the static pattern of a weight matrix just requiring the offline compression, a real-time compression processor module is devised and employed to deal with the dynamic matrix pattern for updating online the inputs/outputs within FPGAs Block RAMs. In the next layer computation, only the valid data needs to be restored by following the necessary index information and broadcasting to those corresponding sparse weight matrices, which in turn generates the next batch inputs/outputs. The design has realized a typical CNN such as VGG on Xilinx Virtex 7 FPGA device for verification and achieves an overall performance of 629.4 GOPS. Meanwhile, the preliminary experimental results demonstrate 2.2 (up to 5.5) times improvement in terms of equivalent GOPS per DSP Block achieved with our adaptive sparsity exploitation approach, when compared to the other conventional counterparts.\",\"PeriodicalId\":253998,\"journal\":{\"name\":\"2019 32nd IEEE International System-on-Chip Conference (SOCC)\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 32nd IEEE International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SOCC46988.2019.1570558495\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 32nd IEEE International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SOCC46988.2019.1570558495","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

为了解决深度卷积神经网络(cnn)所带来的高计算复杂度的挑战，为了减少硬件操作开销，人们尝试使用快速Winograd变换算法和稀疏度探索方法来实现。然而，以往的研究主要集中在处理权值滤波器的固定稀疏模式上。在本文中，我们专注于开发winograd转换网络的输入/输出激活中存在的不同稀疏模式的特征。为此，提出了一种动态压缩矩阵稀疏变换乘法的方法。这种处理流的特点是数据索引和恢复。因为它们是在推理过程中动态生成的，所以输入/输出高度依赖于正在处理的实际数据。与仅需要离线压缩的权重矩阵静态模式不同，设计并使用了实时压缩处理器模块来处理动态矩阵模式，以在线更新fpga块ram内的输入/输出。在下一层计算中，只需要通过遵循必要的索引信息并广播到相应的稀疏权重矩阵来恢复有效的数据，从而生成下一批输入/输出。本设计在Xilinx Virtex 7 FPGA器件上实现了VGG等典型的CNN进行验证，总体性能达到629.4 GOPS。同时，初步实验结果表明，与其他传统方法相比，我们的自适应稀疏性开发方法在每个DSP块的等效GOPS方面提高了2.2(高达5.5)倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Operation-Minimized FPGA Accelerator Design by Dynamically Exploiting Sparsity in CNN Winograd Transform

To address the challenges of high computational complexity incurred in deep convolutional neural networks (CNNs), implementations by both the Fast Winograd Transform algorithm and the sparsity exploration method have been attempted in order to reduce the hardware operation overhead. Yet, the previous studies have been mainly concentrated on dealing with the fixed sparsity patterns of the weight filter. In this paper, we focus the effort specifically towards exploiting the characteristics of varying sparsity patterns existing in the input/output Activations of the Winograd-transformed network. To this end, a dynamically compressing approach for multiplication with the sparsity-changing matrix is proposed. Such a processing flow features in data indexing and restoring. Because they are dynamically generated during the inference process, the inputs/outputs are highly dependent on the actual data being processed. Unlike the static pattern of a weight matrix just requiring the offline compression, a real-time compression processor module is devised and employed to deal with the dynamic matrix pattern for updating online the inputs/outputs within FPGAs Block RAMs. In the next layer computation, only the valid data needs to be restored by following the necessary index information and broadcasting to those corresponding sparse weight matrices, which in turn generates the next batch inputs/outputs. The design has realized a typical CNN such as VGG on Xilinx Virtex 7 FPGA device for verification and achieves an overall performance of 629.4 GOPS. Meanwhile, the preliminary experimental results demonstrate 2.2 (up to 5.5) times improvement in terms of equivalent GOPS per DSP Block achieved with our adaptive sparsity exploitation approach, when compared to the other conventional counterparts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 32nd IEEE International System-on-Chip Conference (SOCC)

自引率

0.00%

发文量