SNAPE-FP: SqueezeNet CNN with Accelerated Pooling Layers Extension based on IEEE-754 Floating Point Implementation through SW/HW Partitioning On ZYNQ SoC
Abdelrhman M. Abotaleb, Mohab H. Ahmed, Mazen A. Fathi
{"title":"SNAPE-FP: SqueezeNet CNN with Accelerated Pooling Layers Extension based on IEEE-754 Floating Point Implementation through SW/HW Partitioning On ZYNQ SoC","authors":"Abdelrhman M. Abotaleb, Mohab H. Ahmed, Mazen A. Fathi","doi":"10.1109/NILES53778.2021.9600528","DOIUrl":null,"url":null,"abstract":"It is clearly known that deep learning applications are enormously used in the image classification, object tracking and related image analysis techniques. But deep learning networks usually involve huge number of parameters that need to be extensively processed to produce the classification output, which also takes a considerable time. GPUs are exploited to do such huge parallel computations to be finished within acceptable time. Still GPUs consume huge power, so they are not suitable for embedded solutions, and also they are very expensive. In the current work, complete implementation of floating point based SqueezeNet convolutional neural network (CNN) is done on ZYNQ System-On-Chip (SoC) XC7020 via partitioning the implementation on both the software part (ARM) and the FPGA part (Artix-7), the acceleration is done via parallel implementations of average pool layer on up to 3 channels with speedup = 6.37 for the Max Pool layer accelerated single channel and 13.88 for the Average Pool layer accelerated 3 channels in parallel. The maximum power consumption equals 1.549 watt (only 0.136 watt for the static power consumption) and the remaining is the dynamic power consumption which is greatly less than the GPU power consumption (reaches ~ 60 watt).","PeriodicalId":249153,"journal":{"name":"2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NILES53778.2021.9600528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
It is clearly known that deep learning applications are enormously used in the image classification, object tracking and related image analysis techniques. But deep learning networks usually involve huge number of parameters that need to be extensively processed to produce the classification output, which also takes a considerable time. GPUs are exploited to do such huge parallel computations to be finished within acceptable time. Still GPUs consume huge power, so they are not suitable for embedded solutions, and also they are very expensive. In the current work, complete implementation of floating point based SqueezeNet convolutional neural network (CNN) is done on ZYNQ System-On-Chip (SoC) XC7020 via partitioning the implementation on both the software part (ARM) and the FPGA part (Artix-7), the acceleration is done via parallel implementations of average pool layer on up to 3 channels with speedup = 6.37 for the Max Pool layer accelerated single channel and 13.88 for the Average Pool layer accelerated 3 channels in parallel. The maximum power consumption equals 1.549 watt (only 0.136 watt for the static power consumption) and the remaining is the dynamic power consumption which is greatly less than the GPU power consumption (reaches ~ 60 watt).