Exploring Efficient Hardware Accelerator for Learning-Based Image Compression

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-12-11 DOI:10.1109/TCAD.2024.3515856

Chen Chen;Haoyang Zhang;Kaicheng Guo;Xingzi Yu;Weidong Qiu;Zhengwei Qi;Haibing Guan

{"title":"Exploring Efficient Hardware Accelerator for Learning-Based Image Compression","authors":"Chen Chen;Haoyang Zhang;Kaicheng Guo;Xingzi Yu;Weidong Qiu;Zhengwei Qi;Haibing Guan","doi":"10.1109/TCAD.2024.3515856","DOIUrl":null,"url":null,"abstract":"Recently, learning-based image compression (LIC) methods have surpassed manually designed approaches in both compression quality and bitrate. However, increasing computational demands and insufficient optimizations in codec performance have hindered the advancement of LIC acceleration. Most researches focus on optimizing specific components, often neglecting the sources of underutilization during the execution of LIC models. Generally, efficient LIC acceleration encounters three primary challenges: 1) extra overheads introduced by individual optimizations; 2) load and computation imbalances in small kernels; and 3) mismatches between hardware configurations and the LIC models. To address these challenges, we propose a framework named extensive accelerator for LIC (X-LIC) for efficiently exploring the design space under constrained resources. First, we quantitatively characterize a representative LIC model, including its latency, computation size, and temporal utilization across various accelerators. We design a hardware-optimized quantization method to compensate for the lack of LIC-oriented research, particularly regarding data precision, distortion, and resource consumption. Additionally, we propose a parameterized LIC accelerator architecture that integrates seamlessly with existing loop optimization models and supports various LIC operators. Two optimization schemes are proposed for redundant computation in transposed convolution and load and computation imbalance in small kernels. Experimental results show that our framework demonstrates significant flexibility across a broad design space, achieving an average of 78%–95% of the theoretical peak performance and up to 688.2/759.1 GOP/s en/de-coder performance with INT8 precision. As a result, the en/de-coder performance can reach up to 33/36 FPS in 720P resolution. An FPGA demo of X-LIC is available at <uri>https://github.com/sjtu-tcloud/X-LIC</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2204-2217"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10793077/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, learning-based image compression (LIC) methods have surpassed manually designed approaches in both compression quality and bitrate. However, increasing computational demands and insufficient optimizations in codec performance have hindered the advancement of LIC acceleration. Most researches focus on optimizing specific components, often neglecting the sources of underutilization during the execution of LIC models. Generally, efficient LIC acceleration encounters three primary challenges: 1) extra overheads introduced by individual optimizations; 2) load and computation imbalances in small kernels; and 3) mismatches between hardware configurations and the LIC models. To address these challenges, we propose a framework named extensive accelerator for LIC (X-LIC) for efficiently exploring the design space under constrained resources. First, we quantitatively characterize a representative LIC model, including its latency, computation size, and temporal utilization across various accelerators. We design a hardware-optimized quantization method to compensate for the lack of LIC-oriented research, particularly regarding data precision, distortion, and resource consumption. Additionally, we propose a parameterized LIC accelerator architecture that integrates seamlessly with existing loop optimization models and supports various LIC operators. Two optimization schemes are proposed for redundant computation in transposed convolution and load and computation imbalance in small kernels. Experimental results show that our framework demonstrates significant flexibility across a broad design space, achieving an average of 78%–95% of the theoretical peak performance and up to 688.2/759.1 GOP/s en/de-coder performance with INT8 precision. As a result, the en/de-coder performance can reach up to 33/36 FPS in 720P resolution. An FPGA demo of X-LIC is available at https://github.com/sjtu-tcloud/X-LIC.

查看原文本刊更多论文

探索基于学习的图像压缩的高效硬件加速器

近年来，基于学习的图像压缩（LIC）方法在压缩质量和比特率方面都超过了人工设计的方法。然而，不断增加的计算需求和编解码器性能优化不足阻碍了LIC加速的发展。大多数研究都集中在优化特定组件上，往往忽略了LIC模型执行过程中未充分利用的根源。通常，高效的LIC加速会遇到三个主要挑战：1)单个优化带来的额外开销；2)小内核的负载和计算不平衡；3)硬件配置与LIC型号不匹配。为了应对这些挑战，我们提出了一个名为LIC （X-LIC）的扩展加速器框架，以有效地探索有限资源下的设计空间。首先，我们定量地描述了一个代表性的LIC模型，包括它的延迟、计算大小和跨各种加速器的时间利用率。我们设计了一种硬件优化的量化方法来弥补面向lic研究的不足，特别是在数据精度，失真和资源消耗方面。此外，我们提出了一个参数化的LIC加速器架构，该架构与现有的循环优化模型无缝集成，并支持各种LIC操作符。针对转置卷积的冗余计算和小核的负载与计算不平衡问题，提出了两种优化方案。实验结果表明，我们的框架在广泛的设计空间中表现出显著的灵活性，平均达到理论峰值性能的78%-95%，达到688.2/759.1 GOP/s的编码/解码性能，精度为INT8。因此，在720P分辨率下，en/ decocoder性能可以达到33/36 FPS。X-LIC的FPGA演示可在https://github.com/sjtu-tcloud/X-LIC上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.