Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands

Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis
{"title":"Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands","authors":"Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis","doi":"10.1109/ISQED57927.2023.10129341","DOIUrl":null,"url":null,"abstract":"Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":" 33","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 24th International Symposium on Quality Electronic Design (ISQED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED57927.2023.10129341","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.
灵活扩展卷积和混合精度操作数的轻量级指令集
专门用于对象检测和语义分割的现代深度神经网络需要特定的操作来增加或保持其特征映射的分辨率。因此,使用了更通用的卷积层,称为转置卷积和扩展卷积,在输入特征或权重的元素之间添加大量的零。通常,标准的神经网络硬件加速器以直接的方式处理这些卷积,而不注意添加的零,从而导致计算时间增加。为了解决这个问题,最近的工作建议用额外的硬件跳过冗余的元素,或者只在有限的膨胀率范围内有效地解决问题。我们提出了一种加速转置和扩展卷积的通用方法,该方法在支持所有扩展速率的同时不会引入任何硬件开销。为了实现这一目标,我们引入了一种新的精确可扩展的轻量级指令集和存储方案,可以应用于不同的卷积变体。这使得DeepLabV3+的速度提高了5倍,优于最近提出的设计方法。对所有工作负载的精确可扩展执行的支持进一步提高了PointPillars、DeepLabV3+和ENet网络的计算时间加速。与最先进的商用EdgeTPU相比,我们设计的加速器的ResNet-50的指令足迹减少了60%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信