VLCQ: Post-training quantization for deep neural networks using variable length coding

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-12-11 DOI:10.1016/j.future.2024.107654

Reem Abdel-Salam, Ahmed H. Abdel-Gawad, Amr G. Wassal

{"title":"VLCQ: Post-training quantization for deep neural networks using variable length coding","authors":"Reem Abdel-Salam, Ahmed H. Abdel-Gawad, Amr G. Wassal","doi":"10.1016/j.future.2024.107654","DOIUrl":null,"url":null,"abstract":"<div><div>Quantization plays a crucial role in efficiently deploying deep learning models on resources constraint devices. Post-training quantization does not require either access to the original dataset or retraining the full model. Current methods that achieve high performance (near baseline results) require INT8 fixed-point integers. However, to achieve high model compression by achieving lower bit-width, significant degradation to the performance becomes the challenge. In this paper, we propose VLCQ, which relaxes the constraint of fixed-point encoding which limits the quantization techniques from better quantizing the weights. Therefore, this work utilizes variable-length encoding which allows for exploring the whole space of quantization techniques. Thus, achieving much better results (close to or even better than the baseline results) while achieving lower bit-widths without the need to access any training data or to fine-tune the model. Extensive experiments were carried out on various deep-learning models for the image classification and segmentation, and object detection tasks. When compared to state-of-the-art post-training quantization approaches, experimental results reveal that our suggested method offers improved performance with better model compression (lower bit-rate). For per-channel quantization, our method surpassed the FP32 accuracy and Piece-Wise Linear Quantization (PWLQ) method in most models while achieving up-to 6X model compression ratio compared to the FP32 and up-to 1.7X compared to PWLQ. If the model compression is the concern with little effect on performance, our method achieves up-to 12.25X compression ratio compared to FP32 within 4% performance loss. For per-tensor, our method is competitive with Data-Free Quantization scheme (DFQ) in achieving the best performance. However, our method is more flexible in getting lower bit rates than DFQ across the different tasks and models.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107654"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24006186","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Quantization plays a crucial role in efficiently deploying deep learning models on resources constraint devices. Post-training quantization does not require either access to the original dataset or retraining the full model. Current methods that achieve high performance (near baseline results) require INT8 fixed-point integers. However, to achieve high model compression by achieving lower bit-width, significant degradation to the performance becomes the challenge. In this paper, we propose VLCQ, which relaxes the constraint of fixed-point encoding which limits the quantization techniques from better quantizing the weights. Therefore, this work utilizes variable-length encoding which allows for exploring the whole space of quantization techniques. Thus, achieving much better results (close to or even better than the baseline results) while achieving lower bit-widths without the need to access any training data or to fine-tune the model. Extensive experiments were carried out on various deep-learning models for the image classification and segmentation, and object detection tasks. When compared to state-of-the-art post-training quantization approaches, experimental results reveal that our suggested method offers improved performance with better model compression (lower bit-rate). For per-channel quantization, our method surpassed the FP32 accuracy and Piece-Wise Linear Quantization (PWLQ) method in most models while achieving up-to 6X model compression ratio compared to the FP32 and up-to 1.7X compared to PWLQ. If the model compression is the concern with little effect on performance, our method achieves up-to 12.25X compression ratio compared to FP32 within 4% performance loss. For per-tensor, our method is competitive with Data-Free Quantization scheme (DFQ) in achieving the best performance. However, our method is more flexible in getting lower bit rates than DFQ across the different tasks and models.

查看原文本刊更多论文

使用变长编码的深度神经网络的训练后量化

量化在资源约束设备上有效部署深度学习模型中起着至关重要的作用。训练后量化既不需要访问原始数据集，也不需要重新训练整个模型。当前实现高性能（接近基线结果）的方法需要INT8定点整数。然而，为了通过实现更低的位宽来实现高模型压缩，性能的显著下降成为挑战。在本文中，我们提出了VLCQ，它放宽了定点编码的约束，这限制了量化技术更好地量化权值。因此，这项工作利用可变长度编码，允许探索整个空间的量化技术。因此，在不需要访问任何训练数据或微调模型的情况下，获得更好的结果（接近甚至比基线结果更好），同时获得更低的位宽度。在各种深度学习模型上进行了大量的实验，用于图像分类和分割以及目标检测任务。与最先进的训练后量化方法相比，实验结果表明，我们建议的方法具有更好的模型压缩（更低的比特率），从而提高了性能。对于每个通道量化，我们的方法在大多数模型中超过了FP32的精度和分段线性量化（PWLQ）方法，同时与FP32相比可实现高达6倍的模型压缩比，与PWLQ相比可实现高达1.7倍的模型压缩比。如果模型压缩是对性能影响不大的问题，我们的方法在4%的性能损失下实现了高达12.25倍的压缩比。对于每个张量，我们的方法在获得最佳性能方面与无数据量化方案（DFQ）相竞争。然而，我们的方法在不同的任务和模型中获得比DFQ更低的比特率方面更加灵活。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.