Wei-Pau Kiat, Wai Kong Lee, Hung-Khoon Tan, Hui-Fuang Ng
{"title":"流水线ShiftAddNet:一种基于fpga的低硬件消耗CNN实现","authors":"Wei-Pau Kiat, Wai Kong Lee, Hung-Khoon Tan, Hui-Fuang Ng","doi":"10.1002/cta.4419","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>ShiftAddNet is a recently proposed multiplier-less CNN that replaces conventional multiplication with cheaper shift and add operations, which makes it suitable for hardware implementation. In this paper, we present the first implementation of ShiftAddNet FPGA inference core, which achieves low area consumption and fast computation. ShiftAddNet combined the convolutional layer of the DeepShift-PS (denoted as Shift-Accumulate, <i>sac</i>) and AdderNet (denoted as Add-Accumulate, <i>aac</i>) into a single computational stage. Due to this reason, there are data dependencies between the <i>sac</i> and <i>aac</i>, which prohibits them from being executed in parallel, resulting in \n<span></span><math>\n <mn>2</mn>\n <mo>×</mo></math> more operations compared to other multiplier-less CNNs like DeepShift-PS and AdderNet. To overcome this performance bottleneck, we proposed a novel technique to allow pipeline processing between <i>sac</i> and <i>aac</i>, effectively reducing the latency. The proposed ShiftAddNet-18 was evaluated on a small ResNet-18, achieving 11.37 ms of latency per image, which is \n<span></span><math>\n <mo>∼</mo></math>69.21% faster than the original version that takes 19.24 ms. On a denser network, the proposed pipeline ShiftAddNet-101 requires only 61.92 ms as compared to the original version of 98.85 ms, showing a latency reduction of \n<span></span><math>\n <mo>∼</mo></math>37.1%. Compared to the state-of-the-art multiplier-less CNN core (e.g., AdderNet), our work is 20% slower in latency but provides higher accuracy and consumes \n<span></span><math>\n <mn>2.2</mn>\n <mo>×</mo></math> less DSP.</p>\n </div>","PeriodicalId":13874,"journal":{"name":"International Journal of Circuit Theory and Applications","volume":"53 9","pages":"5538-5547"},"PeriodicalIF":1.6000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pipeline ShiftAddNet: An FPGA-Based CNN Implementation With Low Hardware Consumption Targeting Constrained Devices\",\"authors\":\"Wei-Pau Kiat, Wai Kong Lee, Hung-Khoon Tan, Hui-Fuang Ng\",\"doi\":\"10.1002/cta.4419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>ShiftAddNet is a recently proposed multiplier-less CNN that replaces conventional multiplication with cheaper shift and add operations, which makes it suitable for hardware implementation. In this paper, we present the first implementation of ShiftAddNet FPGA inference core, which achieves low area consumption and fast computation. ShiftAddNet combined the convolutional layer of the DeepShift-PS (denoted as Shift-Accumulate, <i>sac</i>) and AdderNet (denoted as Add-Accumulate, <i>aac</i>) into a single computational stage. Due to this reason, there are data dependencies between the <i>sac</i> and <i>aac</i>, which prohibits them from being executed in parallel, resulting in \\n<span></span><math>\\n <mn>2</mn>\\n <mo>×</mo></math> more operations compared to other multiplier-less CNNs like DeepShift-PS and AdderNet. To overcome this performance bottleneck, we proposed a novel technique to allow pipeline processing between <i>sac</i> and <i>aac</i>, effectively reducing the latency. The proposed ShiftAddNet-18 was evaluated on a small ResNet-18, achieving 11.37 ms of latency per image, which is \\n<span></span><math>\\n <mo>∼</mo></math>69.21% faster than the original version that takes 19.24 ms. On a denser network, the proposed pipeline ShiftAddNet-101 requires only 61.92 ms as compared to the original version of 98.85 ms, showing a latency reduction of \\n<span></span><math>\\n <mo>∼</mo></math>37.1%. Compared to the state-of-the-art multiplier-less CNN core (e.g., AdderNet), our work is 20% slower in latency but provides higher accuracy and consumes \\n<span></span><math>\\n <mn>2.2</mn>\\n <mo>×</mo></math> less DSP.</p>\\n </div>\",\"PeriodicalId\":13874,\"journal\":{\"name\":\"International Journal of Circuit Theory and Applications\",\"volume\":\"53 9\",\"pages\":\"5538-5547\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Circuit Theory and Applications\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cta.4419\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Circuit Theory and Applications","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cta.4419","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Pipeline ShiftAddNet: An FPGA-Based CNN Implementation With Low Hardware Consumption Targeting Constrained Devices
ShiftAddNet is a recently proposed multiplier-less CNN that replaces conventional multiplication with cheaper shift and add operations, which makes it suitable for hardware implementation. In this paper, we present the first implementation of ShiftAddNet FPGA inference core, which achieves low area consumption and fast computation. ShiftAddNet combined the convolutional layer of the DeepShift-PS (denoted as Shift-Accumulate, sac) and AdderNet (denoted as Add-Accumulate, aac) into a single computational stage. Due to this reason, there are data dependencies between the sac and aac, which prohibits them from being executed in parallel, resulting in
more operations compared to other multiplier-less CNNs like DeepShift-PS and AdderNet. To overcome this performance bottleneck, we proposed a novel technique to allow pipeline processing between sac and aac, effectively reducing the latency. The proposed ShiftAddNet-18 was evaluated on a small ResNet-18, achieving 11.37 ms of latency per image, which is
69.21% faster than the original version that takes 19.24 ms. On a denser network, the proposed pipeline ShiftAddNet-101 requires only 61.92 ms as compared to the original version of 98.85 ms, showing a latency reduction of
37.1%. Compared to the state-of-the-art multiplier-less CNN core (e.g., AdderNet), our work is 20% slower in latency but provides higher accuracy and consumes
less DSP.
期刊介绍:
The scope of the Journal comprises all aspects of the theory and design of analog and digital circuits together with the application of the ideas and techniques of circuit theory in other fields of science and engineering. Examples of the areas covered include: Fundamental Circuit Theory together with its mathematical and computational aspects; Circuit modeling of devices; Synthesis and design of filters and active circuits; Neural networks; Nonlinear and chaotic circuits; Signal processing and VLSI; Distributed, switched and digital circuits; Power electronics; Solid state devices. Contributions to CAD and simulation are welcome.