Multi-Domain Spatial-Temporal Redundancy Mining for Efficient Learned Video Compression

IF 4.8 1区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Broadcasting Pub Date : 2025-07-23 DOI:10.1109/TBC.2025.3587532

Feng Yuan;Zhaoqing Pan;Jianjun Lei;Bo Peng;Fu Lee Wang;Sam Kwong

{"title":"Multi-Domain Spatial-Temporal Redundancy Mining for Efficient Learned Video Compression","authors":"Feng Yuan;Zhaoqing Pan;Jianjun Lei;Bo Peng;Fu Lee Wang;Sam Kwong","doi":"10.1109/TBC.2025.3587532","DOIUrl":null,"url":null,"abstract":"The Conditional Coding-based Learned Video Compression (CC-LVC) has become an important paradigm in learned video compression, because it can effectively explore spatial-temporal redundancies within a huge context space. However, existing CC-LVC methods cannot accurately model motion information and efficiently mine contextual correlations for complex regions with non-rigid motions and non-linear deformations. To address these problems, an efficient CC-LVC method is proposed in this paper, which mines spatial-temporal dependencies across multiple motion domains and receptive domains for improving the video coding efficiency. To accurately model complex motions and generate precise temporal contexts, a Multi-domain Motion modeling Network (MMNet) is proposed to capture robust motion information from both spatial and frequency domains. Moreover, a multi-domain context refinement module is developed to discriminatively highlight frequency-domain temporal contexts and adaptively fuse multi-domain temporal contexts, which can effectively mitigate inaccuracies in temporal contexts caused by motion errors. In order to efficiently compress video signals, a Multi-scale Long Short-range Decorrelation Module (MLSDM)-based context codec is proposed, in which an MLSDM is designed to learn long short-range spatial-temporal dependencies and channel-wise correlations across varying receptive domains. Extensive experimental results show that the proposed method significantly outperforms VTM 17.0 and other state-of-the-art learned video compression methods in terms of both PSNR and MS-SSIM.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"71 3","pages":"808-820"},"PeriodicalIF":4.8000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Broadcasting","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11090160/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The Conditional Coding-based Learned Video Compression (CC-LVC) has become an important paradigm in learned video compression, because it can effectively explore spatial-temporal redundancies within a huge context space. However, existing CC-LVC methods cannot accurately model motion information and efficiently mine contextual correlations for complex regions with non-rigid motions and non-linear deformations. To address these problems, an efficient CC-LVC method is proposed in this paper, which mines spatial-temporal dependencies across multiple motion domains and receptive domains for improving the video coding efficiency. To accurately model complex motions and generate precise temporal contexts, a Multi-domain Motion modeling Network (MMNet) is proposed to capture robust motion information from both spatial and frequency domains. Moreover, a multi-domain context refinement module is developed to discriminatively highlight frequency-domain temporal contexts and adaptively fuse multi-domain temporal contexts, which can effectively mitigate inaccuracies in temporal contexts caused by motion errors. In order to efficiently compress video signals, a Multi-scale Long Short-range Decorrelation Module (MLSDM)-based context codec is proposed, in which an MLSDM is designed to learn long short-range spatial-temporal dependencies and channel-wise correlations across varying receptive domains. Extensive experimental results show that the proposed method significantly outperforms VTM 17.0 and other state-of-the-art learned video compression methods in terms of both PSNR and MS-SSIM.

查看原文本刊更多论文

基于多域时空冗余挖掘的高效学习视频压缩

基于条件编码的学习视频压缩（CC-LVC）由于能够在巨大的上下文空间内有效地探索时空冗余，已成为学习视频压缩的重要范式。然而，对于具有非刚性运动和非线性变形的复杂区域，现有的CC-LVC方法无法准确地建模运动信息并有效地挖掘上下文相关性。为了解决这些问题，本文提出了一种高效的CC-LVC方法，该方法在多个运动域和接受域之间挖掘时空依赖关系，以提高视频编码效率。为了准确地建模复杂运动并生成精确的时间背景，提出了一种多域运动建模网络（MMNet），从空间和频域捕获鲁棒的运动信息。此外，开发了多域上下文细化模块，对频域时间上下文进行判别突出，并自适应融合多域时间上下文，有效缓解运动误差引起的时间上下文不准确性。为了有效地压缩视频信号，提出了一种基于多尺度长短距离去相关模块（MLSDM）的上下文编解码器，其中MLSDM设计用于学习长短距离时空依赖性和不同接收域的信道相关。大量的实验结果表明，该方法在PSNR和MS-SSIM方面都明显优于VTM 17.0和其他最先进的学习视频压缩方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Broadcasting 工程技术-电信学

CiteScore

9.40

自引率

31.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The Society’s Field of Interest is “Devices, equipment, techniques and systems related to broadcast technology, including the production, distribution, transmission, and propagation aspects.” In addition to this formal FOI statement, which is used to provide guidance to the Publications Committee in the selection of content, the AdCom has further resolved that “broadcast systems includes all aspects of transmission, propagation, and reception.”