SNAPE-FP: SqueezeNet CNN with Accelerated Pooling Layers Extension based on IEEE-754 Floating Point Implementation through SW/HW Partitioning On ZYNQ SoC

Abdelrhman M. Abotaleb, Mohab H. Ahmed, Mazen A. Fathi
{"title":"SNAPE-FP: SqueezeNet CNN with Accelerated Pooling Layers Extension based on IEEE-754 Floating Point Implementation through SW/HW Partitioning On ZYNQ SoC","authors":"Abdelrhman M. Abotaleb, Mohab H. Ahmed, Mazen A. Fathi","doi":"10.1109/NILES53778.2021.9600528","DOIUrl":null,"url":null,"abstract":"It is clearly known that deep learning applications are enormously used in the image classification, object tracking and related image analysis techniques. But deep learning networks usually involve huge number of parameters that need to be extensively processed to produce the classification output, which also takes a considerable time. GPUs are exploited to do such huge parallel computations to be finished within acceptable time. Still GPUs consume huge power, so they are not suitable for embedded solutions, and also they are very expensive. In the current work, complete implementation of floating point based SqueezeNet convolutional neural network (CNN) is done on ZYNQ System-On-Chip (SoC) XC7020 via partitioning the implementation on both the software part (ARM) and the FPGA part (Artix-7), the acceleration is done via parallel implementations of average pool layer on up to 3 channels with speedup = 6.37 for the Max Pool layer accelerated single channel and 13.88 for the Average Pool layer accelerated 3 channels in parallel. The maximum power consumption equals 1.549 watt (only 0.136 watt for the static power consumption) and the remaining is the dynamic power consumption which is greatly less than the GPU power consumption (reaches ~ 60 watt).","PeriodicalId":249153,"journal":{"name":"2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NILES53778.2021.9600528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

It is clearly known that deep learning applications are enormously used in the image classification, object tracking and related image analysis techniques. But deep learning networks usually involve huge number of parameters that need to be extensively processed to produce the classification output, which also takes a considerable time. GPUs are exploited to do such huge parallel computations to be finished within acceptable time. Still GPUs consume huge power, so they are not suitable for embedded solutions, and also they are very expensive. In the current work, complete implementation of floating point based SqueezeNet convolutional neural network (CNN) is done on ZYNQ System-On-Chip (SoC) XC7020 via partitioning the implementation on both the software part (ARM) and the FPGA part (Artix-7), the acceleration is done via parallel implementations of average pool layer on up to 3 channels with speedup = 6.37 for the Max Pool layer accelerated single channel and 13.88 for the Average Pool layer accelerated 3 channels in parallel. The maximum power consumption equals 1.549 watt (only 0.136 watt for the static power consumption) and the remaining is the dynamic power consumption which is greatly less than the GPU power consumption (reaches ~ 60 watt).
在ZYNQ SoC上通过SW/HW分区实现基于IEEE-754浮点数的SqueezeNet CNN与加速池层扩展
众所周知,深度学习应用在图像分类、目标跟踪和相关图像分析技术中有着广泛的应用。但深度学习网络通常涉及大量参数,需要大量处理才能产生分类输出,这也需要相当长的时间。gpu被用于在可接受的时间内完成如此巨大的并行计算。gpu仍然消耗巨大的功率,因此它们不适合嵌入式解决方案,而且它们也非常昂贵。在当前的工作,完整实现基于浮点SqueezeNet卷积神经网络(CNN)是以ZYNQ SoC (SoC) XC7020通过分区实现在软件部分(手臂)和FPGA部分(Artix-7),加速度是通过并行实现的平均池层3通道加速= 6.37马克斯池层加速单通道和13.88平均池层3通道并行加速。最大功耗为1.549瓦(静态功耗仅为0.136瓦),其余为动态功耗,远低于GPU功耗(达到~ 60瓦)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信