Spatio-Temporal Convolutional Neural Network for Enhanced Inter Prediction in Video Coding

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-08-26 DOI:10.1109/TIP.2024.3446228

Philipp Merkle;Martin Winken;Jonathan Pfaff;Heiko Schwarz;Detlev Marpe;Thomas Wiegand

{"title":"Spatio-Temporal Convolutional Neural Network for Enhanced Inter Prediction in Video Coding","authors":"Philipp Merkle;Martin Winken;Jonathan Pfaff;Heiko Schwarz;Detlev Marpe;Thomas Wiegand","doi":"10.1109/TIP.2024.3446228","DOIUrl":null,"url":null,"abstract":"This paper presents a convolutional neural network (CNN)-based enhancement to inter prediction in Versatile Video Coding (VVC). Our approach aims at improving the prediction signal of inter blocks with a residual CNN that incorporates spatial and temporal reference samples. It is motivated by the theoretical consideration that neural network-based methods have a higher degree of signal adaptivity than conventional signal processing methods and that spatially neighboring reference samples have the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity. We show that adding a polyphase decomposition stage to the CNN results in a significantly better trade-off between computational complexity and coding performance. Incorporating spatial reference samples in the inter prediction process is challenging: The fact that the input of the CNN for one block may depend on the output of the CNN for preceding blocks prohibits parallel processing. We solve this by introducing a novel signal plane that contains specifically constrained reference samples, enabling parallel decoding while maintaining a high compression efficiency. Overall, experimental results show average bit rate savings of 4.07% and 3.47% for the random access (RA) and low-delay B (LB) configurations of the JVET common test conditions, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10648618","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10648618/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a convolutional neural network (CNN)-based enhancement to inter prediction in Versatile Video Coding (VVC). Our approach aims at improving the prediction signal of inter blocks with a residual CNN that incorporates spatial and temporal reference samples. It is motivated by the theoretical consideration that neural network-based methods have a higher degree of signal adaptivity than conventional signal processing methods and that spatially neighboring reference samples have the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity. We show that adding a polyphase decomposition stage to the CNN results in a significantly better trade-off between computational complexity and coding performance. Incorporating spatial reference samples in the inter prediction process is challenging: The fact that the input of the CNN for one block may depend on the output of the CNN for preceding blocks prohibits parallel processing. We solve this by introducing a novel signal plane that contains specifically constrained reference samples, enabling parallel decoding while maintaining a high compression efficiency. Overall, experimental results show average bit rate savings of 4.07% and 3.47% for the random access (RA) and low-delay B (LB) configurations of the JVET common test conditions, respectively.

查看原文本刊更多论文

用于增强视频编码中相互预测的时空卷积神经网络

本文提出了一种基于卷积神经网络（CNN）的增强多用途视频编码（VVC）中的区间预测方法。我们的方法旨在利用结合了空间和时间参考样本的残差 CNN 改进块间预测信号。其理论依据是，与传统信号处理方法相比，基于神经网络的方法具有更高的信号适应性，而且空间上相邻的参考样本有可能通过适应其附近的重建信号来改善预测信号。我们的研究表明，在 CNN 中加入多相分解阶段，可以在计算复杂性和编码性能之间实现更好的权衡。在预测过程中加入空间参考样本具有挑战性：事实上，一个区块的 CNN 输入可能取决于前几个区块的 CNN 输出，这阻碍了并行处理。我们通过引入包含特定限制参考样本的新型信号平面来解决这一问题，从而在保持较高压缩效率的同时实现并行解码。总体而言，实验结果表明，在 JVET 常见测试条件下，随机存取 (RA) 和低延迟 B (LB) 配置的平均比特率分别节省了 4.07% 和 3.47%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量

文献相关原料

公司名称	产品信息	采购帮参考价格