面向可扩展人机图像压缩的任务适应可学习嵌入式量化

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI:10.1109/TCSVT.2025.3525664

Shaohui Li;Shuoyu Ma;Wenrui Dai;Nuowen Kan;Fan Cheng;Chenglin Li;Junni Zou;Hongkai Xiong

{"title":"面向可扩展人机图像压缩的任务适应可学习嵌入式量化","authors":"Shaohui Li;Shuoyu Ma;Wenrui Dai;Nuowen Kan;Fan Cheng;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TCSVT.2025.3525664","DOIUrl":null,"url":null,"abstract":"Image compression for both human and machine vision has become prevailing to accommodate to rising demands for machine-machine and human-machine communications. Scalable human-machine image compression is recently emerging as an efficient alternative to simultaneously achieve high accuracy for machine vision in the base layer and obtain high-fidelity reconstruction for human vision in the enhancement layer. However, existing methods achieve scalable coding with heuristic mechanisms, which cannot fully exploit the inter-layer correlations and evidently sacrifice rate-distortion performance. In this paper, we propose task-adapted learnable embedded quantization to address this problem in an analytically optimized fashion. We first reveal the relationship between the latent representations for machine and human vision and demonstrate that optimal representation for machine vision can be approximated with post-training optimization on the learned representation for human vision. On such basis, we propose task-adapted learnable embedded quantization that leverages learnable step predictor to adaptively determine the optimal quantization step for diverse machine vision tasks such that inter-layer correlations between representations for human and machine vision are sufficiently exploited using embedded quantization. Furthermore, we develop a human-machine scalable coding framework by incorporating the proposed embedded quantization into pre-trained learned image compression models. Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on machine vision tasks like object detection, instance segmentation, and panoptic segmentation with negligible loss in rate-distortion performance for human vision.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4768-4783"},"PeriodicalIF":8.3000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Task–Adapted Learnable Embedded Quantization for Scalable Human-Machine Image Compression\",\"authors\":\"Shaohui Li;Shuoyu Ma;Wenrui Dai;Nuowen Kan;Fan Cheng;Chenglin Li;Junni Zou;Hongkai Xiong\",\"doi\":\"10.1109/TCSVT.2025.3525664\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image compression for both human and machine vision has become prevailing to accommodate to rising demands for machine-machine and human-machine communications. Scalable human-machine image compression is recently emerging as an efficient alternative to simultaneously achieve high accuracy for machine vision in the base layer and obtain high-fidelity reconstruction for human vision in the enhancement layer. However, existing methods achieve scalable coding with heuristic mechanisms, which cannot fully exploit the inter-layer correlations and evidently sacrifice rate-distortion performance. In this paper, we propose task-adapted learnable embedded quantization to address this problem in an analytically optimized fashion. We first reveal the relationship between the latent representations for machine and human vision and demonstrate that optimal representation for machine vision can be approximated with post-training optimization on the learned representation for human vision. On such basis, we propose task-adapted learnable embedded quantization that leverages learnable step predictor to adaptively determine the optimal quantization step for diverse machine vision tasks such that inter-layer correlations between representations for human and machine vision are sufficiently exploited using embedded quantization. Furthermore, we develop a human-machine scalable coding framework by incorporating the proposed embedded quantization into pre-trained learned image compression models. Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on machine vision tasks like object detection, instance segmentation, and panoptic segmentation with negligible loss in rate-distortion performance for human vision.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4768-4783\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2025-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10824850/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10824850/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

人类和机器视觉的图像压缩已经成为主流，以适应机器和人机通信日益增长的需求。可扩展的人机图像压缩是近年来出现的一种有效的替代方案，可以同时在基础层实现机器视觉的高精度，并在增强层获得人类视觉的高保真重建。然而，现有方法采用启发式机制实现可扩展编码，不能充分利用层间相关性，并且明显牺牲了率失真性能。在本文中，我们提出了任务适应的可学习嵌入式量化以分析优化的方式来解决这个问题。我们首先揭示了机器视觉和人类视觉的潜在表征之间的关系，并证明了机器视觉的最佳表征可以通过对人类视觉学习表征的训练后优化来近似。在此基础上，我们提出了任务适应的可学习嵌入式量化，它利用可学习步长预测器自适应地确定各种机器视觉任务的最佳量化步长，从而利用嵌入式量化充分利用人类和机器视觉表示之间的层间相关性。此外，我们通过将提出的嵌入式量化纳入预训练的学习图像压缩模型，开发了一个人机可扩展的编码框架。实验结果表明，该框架在对象检测、实例分割和全视分割等机器视觉任务上实现了最先进的性能，而人类视觉的率失真性能损失可以忽略不计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Task–Adapted Learnable Embedded Quantization for Scalable Human-Machine Image Compression

Image compression for both human and machine vision has become prevailing to accommodate to rising demands for machine-machine and human-machine communications. Scalable human-machine image compression is recently emerging as an efficient alternative to simultaneously achieve high accuracy for machine vision in the base layer and obtain high-fidelity reconstruction for human vision in the enhancement layer. However, existing methods achieve scalable coding with heuristic mechanisms, which cannot fully exploit the inter-layer correlations and evidently sacrifice rate-distortion performance. In this paper, we propose task-adapted learnable embedded quantization to address this problem in an analytically optimized fashion. We first reveal the relationship between the latent representations for machine and human vision and demonstrate that optimal representation for machine vision can be approximated with post-training optimization on the learned representation for human vision. On such basis, we propose task-adapted learnable embedded quantization that leverages learnable step predictor to adaptively determine the optimal quantization step for diverse machine vision tasks such that inter-layer correlations between representations for human and machine vision are sufficiently exploited using embedded quantization. Furthermore, we develop a human-machine scalable coding framework by incorporating the proposed embedded quantization into pre-trained learned image compression models. Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on machine vision tasks like object detection, instance segmentation, and panoptic segmentation with negligible loss in rate-distortion performance for human vision.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.