SNAPE-FP: SqueezeNet CNN with Accelerated Pooling Layers Extension based on IEEE-754 Floating Point Implementation through SW/HW Partitioning On ZYNQ SoC

2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES) Pub Date : 2021-10-23 DOI:10.1109/NILES53778.2021.9600528

Abdelrhman M. Abotaleb, Mohab H. Ahmed, Mazen A. Fathi

{"title":"SNAPE-FP: SqueezeNet CNN with Accelerated Pooling Layers Extension based on IEEE-754 Floating Point Implementation through SW/HW Partitioning On ZYNQ SoC","authors":"Abdelrhman M. Abotaleb, Mohab H. Ahmed, Mazen A. Fathi","doi":"10.1109/NILES53778.2021.9600528","DOIUrl":null,"url":null,"abstract":"It is clearly known that deep learning applications are enormously used in the image classification, object tracking and related image analysis techniques. But deep learning networks usually involve huge number of parameters that need to be extensively processed to produce the classification output, which also takes a considerable time. GPUs are exploited to do such huge parallel computations to be finished within acceptable time. Still GPUs consume huge power, so they are not suitable for embedded solutions, and also they are very expensive. In the current work, complete implementation of floating point based SqueezeNet convolutional neural network (CNN) is done on ZYNQ System-On-Chip (SoC) XC7020 via partitioning the implementation on both the software part (ARM) and the FPGA part (Artix-7), the acceleration is done via parallel implementations of average pool layer on up to 3 channels with speedup = 6.37 for the Max Pool layer accelerated single channel and 13.88 for the Average Pool layer accelerated 3 channels in parallel. The maximum power consumption equals 1.549 watt (only 0.136 watt for the static power consumption) and the remaining is the dynamic power consumption which is greatly less than the GPU power consumption (reaches ~ 60 watt).","PeriodicalId":249153,"journal":{"name":"2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NILES53778.2021.9600528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

It is clearly known that deep learning applications are enormously used in the image classification, object tracking and related image analysis techniques. But deep learning networks usually involve huge number of parameters that need to be extensively processed to produce the classification output, which also takes a considerable time. GPUs are exploited to do such huge parallel computations to be finished within acceptable time. Still GPUs consume huge power, so they are not suitable for embedded solutions, and also they are very expensive. In the current work, complete implementation of floating point based SqueezeNet convolutional neural network (CNN) is done on ZYNQ System-On-Chip (SoC) XC7020 via partitioning the implementation on both the software part (ARM) and the FPGA part (Artix-7), the acceleration is done via parallel implementations of average pool layer on up to 3 channels with speedup = 6.37 for the Max Pool layer accelerated single channel and 13.88 for the Average Pool layer accelerated 3 channels in parallel. The maximum power consumption equals 1.549 watt (only 0.136 watt for the static power consumption) and the remaining is the dynamic power consumption which is greatly less than the GPU power consumption (reaches ~ 60 watt).

查看原文本刊更多论文

在ZYNQ SoC上通过SW/HW分区实现基于IEEE-754浮点数的SqueezeNet CNN与加速池层扩展

众所周知，深度学习应用在图像分类、目标跟踪和相关图像分析技术中有着广泛的应用。但深度学习网络通常涉及大量参数，需要大量处理才能产生分类输出，这也需要相当长的时间。gpu被用于在可接受的时间内完成如此巨大的并行计算。gpu仍然消耗巨大的功率，因此它们不适合嵌入式解决方案，而且它们也非常昂贵。在当前的工作,完整实现基于浮点SqueezeNet卷积神经网络(CNN)是以ZYNQ SoC (SoC) XC7020通过分区实现在软件部分(手臂)和FPGA部分(Artix-7),加速度是通过并行实现的平均池层3通道加速= 6.37马克斯池层加速单通道和13.88平均池层3通道并行加速。最大功耗为1.549瓦(静态功耗仅为0.136瓦)，其余为动态功耗，远低于GPU功耗(达到~ 60瓦)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)

自引率

0.00%

发文量