Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2024-08-12 DOI:10.1109/TC.2024.3441860

Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo

{"title":"Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing","authors":"Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo","doi":"10.1109/TC.2024.3441860","DOIUrl":null,"url":null,"abstract":"As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of \n<inline-formula><tex-math>$15.50\\times$</tex-math></inline-formula>\n to \n<inline-formula><tex-math>$47.67\\times$</tex-math></inline-formula>\n over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves \n<inline-formula><tex-math>$2.2\\%\\sim 6.7\\%$</tex-math></inline-formula>\n higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over \n<inline-formula><tex-math>$1.3\\%$</tex-math></inline-formula>\n compared to a greedy-based search.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2504-2519"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10633877/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of

$15.50\times$

$47.67\times$

over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves

$2.2\%\sim 6.7\%$

higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over

$1.3\%$

compared to a greedy-based search.

查看原文本刊更多论文

Edge-MPQ：为边缘计算配备紧密集成的多功能推理单元的分层混合精度量化技术

作为目前流行的深度神经网络压缩技术之一，层智混合精度量化（MPQ）比统一量化方案在精度和效率之间取得了更好的平衡。然而，现有的 MPQ 策略要么缺乏硬件意识，要么会产生巨大的计算成本，从而限制了它们在边缘的部署。此外，研究人员通常根据量化位宽或硬件要求在训练后量化（PTQ）和量化感知训练（QAT）之间做出一次性决定。在本文中，我们提出通过微体系结构和指令集体系结构（ISA）协同设计，将支持 INT2INT8 和 INT16 精确度的多功能 MPQ 推理单元紧密集成到 RISC-V 处理器流水线中。该设计采用 14nm 技术合成，在运行单卷积层内核时，比基准 RV64IMA 内核的速度提高了 15.50 美元到 47.67 美元，并实现了高达 2.86 GOPS 的性能。这项工作还实现了 20.51 TOPS/W 的能效，不仅超越了当代最先进的边缘 MPQ 硬件解决方案，而且标志着该领域的重大进步。我们还提出了一种新颖的 MPQ 搜索算法，该算法结合了硬件感知和训练必要性。该算法使用一组新提出的指标对各层敏感度进行采样，并运行启发式搜索。评估结果表明，与最先进的 MPQ 策略相比，这种搜索算法在类似的硬件限制条件下实现了更高的推理准确率。此外，我们还使用动态编程（DP）策略扩展了搜索空间，以更细粒度的精度区间进行搜索，并支持多维搜索。与基于贪婪的搜索相比，这进一步提高了推理精度超过 1.3%/$。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.