Downscaling and Overflow-aware Model Compression for Efficient Vision Processors

Haokun Li, Jing Liu, Liancheng Jia, Yun Liang, Yaowei Wang, Mingkui Tan
{"title":"Downscaling and Overflow-aware Model Compression for Efficient Vision Processors","authors":"Haokun Li, Jing Liu, Liancheng Jia, Yun Liang, Yaowei Wang, Mingkui Tan","doi":"10.1109/ICDCSW56584.2022.00036","DOIUrl":null,"url":null,"abstract":"Network pruning and quantization are two effective ways for model compression. However, existing model compression methods seldom take hardware into consideration, resulting in compressed models that still take high energy and chip area cost on a vision processor. To address this issue, one may reduce the bit-widths of the accumulator and the multiplier in fixed-point inference to significantly reduce the energy and chip area. However, the numerical error brought from the low-bit multiplier in the downscaling procedure is large, while the low-bit accumulator suffers from the overflow issue. Both of them lead to significant performance degradation. In this paper, we propose downscaling and overflow-aware model compression for efficient vision processors. Specifically, we propose downscaling-aware training to simulate the downscaling procedure during training so that the models are adjusted to inference with low bit-width multipliers. To address the overflow issue, we apply overflow-aware training to gradually reduce the range of quantized values. We further restrict the channel's number of each layer to be the multiple of some value (e.g., 16) to take advantage of parallel computing by channel pruning. With the proposed method, we are able to obtain the compressed model with low bit-width accumulators and multipliers during inference while maintaining the performance. As a result, the energy and chip area cost can be significantly reduced. To demonstrate this, we further co-design an agilely customizable vision processor and its SoC. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our proposed method. For example, on ImageNet, our compressed 8-bit ResNet-50 achieves lossless performance with 16-bit accumulators and 12-bit multipliers.","PeriodicalId":357138,"journal":{"name":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 42nd International Conference on Distributed Computing Systems Workshops (ICDCSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCSW56584.2022.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Network pruning and quantization are two effective ways for model compression. However, existing model compression methods seldom take hardware into consideration, resulting in compressed models that still take high energy and chip area cost on a vision processor. To address this issue, one may reduce the bit-widths of the accumulator and the multiplier in fixed-point inference to significantly reduce the energy and chip area. However, the numerical error brought from the low-bit multiplier in the downscaling procedure is large, while the low-bit accumulator suffers from the overflow issue. Both of them lead to significant performance degradation. In this paper, we propose downscaling and overflow-aware model compression for efficient vision processors. Specifically, we propose downscaling-aware training to simulate the downscaling procedure during training so that the models are adjusted to inference with low bit-width multipliers. To address the overflow issue, we apply overflow-aware training to gradually reduce the range of quantized values. We further restrict the channel's number of each layer to be the multiple of some value (e.g., 16) to take advantage of parallel computing by channel pruning. With the proposed method, we are able to obtain the compressed model with low bit-width accumulators and multipliers during inference while maintaining the performance. As a result, the energy and chip area cost can be significantly reduced. To demonstrate this, we further co-design an agilely customizable vision processor and its SoC. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our proposed method. For example, on ImageNet, our compressed 8-bit ResNet-50 achieves lossless performance with 16-bit accumulators and 12-bit multipliers.
高效视觉处理器的降尺度和溢出感知模型压缩
网络修剪和量化是两种有效的模型压缩方法。然而,现有的模型压缩方法很少考虑硬件因素,导致压缩后的模型在视觉处理器上仍然占用较高的能量和芯片面积成本。为了解决这个问题,可以在定点推理中减小累加器和乘法器的位宽,以显著减少能量和芯片面积。然而,低位乘法器在降尺度过程中带来的数值误差较大,而低位累加器存在溢出问题。这两种方法都会导致显著的性能下降。在本文中,我们提出了用于高效视觉处理器的降尺度和溢出感知模型压缩。具体来说,我们提出了降尺度感知训练来模拟训练过程中的降尺度过程,从而使模型适应低位宽乘法器的推理。为了解决溢出问题,我们采用溢出感知训练逐步减小量化值的范围。我们进一步将每层的通道数限制为某个值(例如16)的倍数,以利用通道修剪并行计算的优势。利用该方法,我们可以在保持性能的前提下,在推理过程中获得具有低位宽累加器和乘法器的压缩模型。因此,能量和芯片面积成本可以显著降低。为了证明这一点,我们进一步共同设计了一个灵活定制的视觉处理器及其SoC。在图像分类、目标检测和语义分割方面的大量实验证明了我们提出的方法的有效性。例如,在ImageNet上,我们压缩的8位ResNet-50通过16位累加器和12位乘法器实现了无损性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信