利用二值化和少量全精度权值的神经网络压缩

IF 8.1 1区计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Sciences Pub Date : 2025-04-30 DOI:10.1016/j.ins.2025.122251

Franco Maria Nardini , Cosimo Rulli , Salvatore Trani , Rossano Venturini

{"title":"利用二值化和少量全精度权值的神经网络压缩","authors":"Franco Maria Nardini , Cosimo Rulli , Salvatore Trani , Rossano Venturini","doi":"10.1016/j.ins.2025.122251","DOIUrl":null,"url":null,"abstract":"<div><div>Quantization and pruning are two effective Deep Neural Network model compression methods. In this paper, we propose <em>Automatic Prune Binarization</em> (<span>APB</span>), a novel compression technique combining quantization with pruning. <span>APB</span> enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using <span>APB</span> by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9× and 1.5× faster than available state-of-the-art solutions. We extensively evaluate <span>APB</span> on two widely adopted model compression datasets, namely CIFAR-10 and ImageNet. <span>APB</span> shows to deliver better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) a combination of pruning and quantization. <span>APB</span> also outperforms quantization in the accuracy/efficiency trade-off, being up to 2× faster than the 2-bits quantized model with no loss in accuracy.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"716 ","pages":"Article 122251"},"PeriodicalIF":8.1000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Neural network compression using binarization and few full-precision weights\",\"authors\":\"Franco Maria Nardini , Cosimo Rulli , Salvatore Trani , Rossano Venturini\",\"doi\":\"10.1016/j.ins.2025.122251\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Quantization and pruning are two effective Deep Neural Network model compression methods. In this paper, we propose <em>Automatic Prune Binarization</em> (<span>APB</span>), a novel compression technique combining quantization with pruning. <span>APB</span> enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using <span>APB</span> by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9× and 1.5× faster than available state-of-the-art solutions. We extensively evaluate <span>APB</span> on two widely adopted model compression datasets, namely CIFAR-10 and ImageNet. <span>APB</span> shows to deliver better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) a combination of pruning and quantization. <span>APB</span> also outperforms quantization in the accuracy/efficiency trade-off, being up to 2× faster than the 2-bits quantized model with no loss in accuracy.</div></div>\",\"PeriodicalId\":51063,\"journal\":{\"name\":\"Information Sciences\",\"volume\":\"716 \",\"pages\":\"Article 122251\"},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2025-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0020025525003834\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025525003834","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

量化和剪枝是两种有效的深度神经网络模型压缩方法。本文提出了一种将量化与剪枝相结合的新型压缩技术——自动剪枝二值化（APB）。APB利用少量全精度权值增强了二值网络的表示能力。我们的技术通过决定每个权重是否应该二值化或保持完全精度来最大限度地提高网络的准确性，同时最大限度地减少对内存的影响。我们将展示如何通过将APB分解为二进制和稀疏密集矩阵乘法来有效地执行使用APB压缩的层的前向传递。此外，我们设计了两种新的高效算法，用于CPU上极度量化的矩阵乘法，利用高效的位运算。提出的算法比现有的最先进的解决方案快6.9倍和1.5倍。我们在CIFAR-10和ImageNet这两个被广泛采用的模型压缩数据集上广泛地评估了APB。与基于i)量化，ii)修剪和iii)修剪和量化组合的最先进方法相比，APB显示出更好的准确性/内存权衡。APB在精度/效率权衡方面也优于量化，比2位量化模型快2倍，而精度没有损失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Neural network compression using binarization and few full-precision weights

Quantization and pruning are two effective Deep Neural Network model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9× and 1.5× faster than available state-of-the-art solutions. We extensively evaluate APB on two widely adopted model compression datasets, namely CIFAR-10 and ImageNet. APB shows to deliver better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) a combination of pruning and quantization. APB also outperforms quantization in the accuracy/efficiency trade-off, being up to 2× faster than the 2-bits quantized model with no loss in accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Sciences 工程技术-计算机：信息系统

CiteScore

14.00

自引率

17.30%

发文量

1322

审稿时长

10.4 months

期刊介绍： Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.