Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference

IF 1.5 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang
{"title":"Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference","authors":"Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang","doi":"10.1145/3617688","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer. Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17 \\(\\% \\) -0.92 \\(\\% \\) memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04 × that of conventional DNN inference workflow.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3617688","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer. Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17 \(\% \) -0.92 \(\% \) memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04 × that of conventional DNN inference workflow.
Smart-DNN+:用于模型推理的高效记忆神经网络压缩框架
深度神经网络(dnn)在各种实际应用中取得了显著的成功。然而,运行DNN通常需要数百兆字节的内存占用,这使得在移动设备和物联网等资源受限的平台上部署具有挑战性。虽然主流的DNN压缩技术,如剪枝、蒸馏和量化,可以减少DNN推理过程中模型参数的内存开销,但它们存在三个局限性:(1)对于冗余少的轻量级DNN结构,模型压缩比低;(ii)模型推理精度的潜在下降;(iii)内存压缩比不足是由于忽略了DNN推理的层次性。为了解决这些问题,我们提出了一个轻量级的内存高效DNN推理框架,称为Smart-DNN+,它在不降低模型质量的情况下显著降低了DNN推理的内存成本。具体来说,①Smart-DNN+应用了一种带重映射机制的分层二进制量化器,通过将典型的32位浮点DNN权重逐层量化为1位符号,大大减小了模型大小。为了保持模型质量,②Smart-DNN+采用了桶编码器,通过将多个相似的浮点残差编码为相同的整数桶id来保持压缩的量化误差。当在用户设备上运行压缩的DNN时,③Smart-DNN+采用部分解压缩策略,首先将压缩的DNN加载到内存中,然后逐层动态解压缩模型推理所需的材料,从而大大减少所需的内存开销。在流行的dnn和数据集上的实验结果表明,与现有技术相比,Smart-DNN+在不降低推理精度的情况下,以更低的运行时开销实现了更低的0.17 \(\% \) -0.92 \(\% \)内存成本。此外,Smart-DNN+有可能将推理运行时间减少到传统DNN推理工作流程的2.04倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization 工程技术-计算机:理论方法
CiteScore
3.60
自引率
6.20%
发文量
78
审稿时长
6-12 weeks
期刊介绍: ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信