H.266/VVC相关量化的低延迟、高流水线硬件架构

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2025-06-06 DOI:10.1109/TCSI.2025.3575567

Jun Zhang;Weizhi Bian;Hao Zhang

{"title":"H.266/VVC相关量化的低延迟、高流水线硬件架构","authors":"Jun Zhang;Weizhi Bian;Hao Zhang","doi":"10.1109/TCSI.2025.3575567","DOIUrl":null,"url":null,"abstract":"In comparison to H.265/HEVC, H.266/VVC introduces a novel quantization tool—dependent quantization, which significantly reduces the rate while maintaining the same video quality. However, due to the quantization process of the transform coefficients being highly dependent on the quantization results of the preceding coefficients, the computational parallelism is low, making it unsuitable for hardware pipeline processing and difficult to achieve real-time encoding. To enhance parallelism, this paper optimizes the rate estimation algorithm based on dependent quantization and designs a multi-quantization state parallel quantization structure, implementing a pipeline-based dependent quantization hardware architecture. The main contributions of this paper are as follows: 1) A hardware-friendly rate estimation algorithm is proposed for calculating the quantization level rate-distortion cost, eliminating the dependency on context templates. 2) A multi state parallel quantization hardware structure is designed to improve the quantization parallelism. Among the multiple generated quantization paths, the shortest quantization path is output by comparing the cumulative rate-distortion cost. Additionally, two trellis memories are introduced during the quantization level output phase, using a ping-pong operation to maximize the output throughput of the quantization module. 3)An 8-stage pipeline computation architecture is proposed for dependent quantization, and the dependent quantization hardware module is implemented, with a computing performance capable of quantizing one transform coefficient per cycle. Experimental results show that the dependent quantization hardware module designed in this paper achieves a maximum frequency of 276MHz, with encoding average speed reaching <inline-formula> <tex-math>$3840\\times 2160$ </tex-math></inline-formula>@31.4,83.5,164.5,242.8fps under QP = 22,27,32,37 conditions. In both All Intra and Random Access configurations, the Bjontegaard Delta Bitrate (BDBR) only increases by 0.81% and 0.85% compared to the standard reference software VTM18.0, respectively. Compared to existing hardware quantization schemes, our approach offers outstanding quantization efficiency and quantization speed.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 8","pages":"4040-4051"},"PeriodicalIF":5.2000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Low-Latency, Highly-Pipelined Hardware Architecture for H.266/VVC Dependent Quantization\",\"authors\":\"Jun Zhang;Weizhi Bian;Hao Zhang\",\"doi\":\"10.1109/TCSI.2025.3575567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In comparison to H.265/HEVC, H.266/VVC introduces a novel quantization tool—dependent quantization, which significantly reduces the rate while maintaining the same video quality. However, due to the quantization process of the transform coefficients being highly dependent on the quantization results of the preceding coefficients, the computational parallelism is low, making it unsuitable for hardware pipeline processing and difficult to achieve real-time encoding. To enhance parallelism, this paper optimizes the rate estimation algorithm based on dependent quantization and designs a multi-quantization state parallel quantization structure, implementing a pipeline-based dependent quantization hardware architecture. The main contributions of this paper are as follows: 1) A hardware-friendly rate estimation algorithm is proposed for calculating the quantization level rate-distortion cost, eliminating the dependency on context templates. 2) A multi state parallel quantization hardware structure is designed to improve the quantization parallelism. Among the multiple generated quantization paths, the shortest quantization path is output by comparing the cumulative rate-distortion cost. Additionally, two trellis memories are introduced during the quantization level output phase, using a ping-pong operation to maximize the output throughput of the quantization module. 3)An 8-stage pipeline computation architecture is proposed for dependent quantization, and the dependent quantization hardware module is implemented, with a computing performance capable of quantizing one transform coefficient per cycle. Experimental results show that the dependent quantization hardware module designed in this paper achieves a maximum frequency of 276MHz, with encoding average speed reaching <inline-formula> <tex-math>$3840\\\\times 2160$ </tex-math></inline-formula>@31.4,83.5,164.5,242.8fps under QP = 22,27,32,37 conditions. In both All Intra and Random Access configurations, the Bjontegaard Delta Bitrate (BDBR) only increases by 0.81% and 0.85% compared to the standard reference software VTM18.0, respectively. Compared to existing hardware quantization schemes, our approach offers outstanding quantization efficiency and quantization speed.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 8\",\"pages\":\"4040-4051\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11027472/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11027472/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

与H.265/HEVC相比，H.266/VVC引入了一种新的依赖于量化工具的量化，在保持相同视频质量的同时显著降低了速率。然而，由于变换系数的量化过程高度依赖于前一系数的量化结果，计算并行性较低，不适合硬件流水线处理，难以实现实时编码。为了提高并行性，本文优化了基于相关量化的速率估计算法，设计了多量化状态并行量化结构，实现了基于流水线的相关量化硬件架构。本文的主要贡献如下：1)提出了一种硬件友好的率估计算法，用于计算量化水平的率失真代价，消除了对上下文模板的依赖。2)设计了多状态并行量化硬件结构，提高了量化并行性。在生成的多个量化路径中，通过比较累积的率失真代价，输出最短的量化路径。此外，在量化级输出阶段引入了两个网格存储器，使用乒乓操作来最大化量化模块的输出吞吐量。3)提出了依赖量化的8级流水线计算架构，实现了依赖量化硬件模块，计算性能达到每周期量化一个变换系数。实验结果表明，本文设计的相关量化硬件模块在QP = 22、27、32、37条件下，最大频率达到276MHz，编码平均速度达到$3840\次2160$ @31.4、83.5164.5242.8 fps。在All Intra和Random Access配置下，与标准参考软件VTM18.0相比，Bjontegaard Delta比特率（BDBR）分别仅提高了0.81%和0.85%。与现有的硬件量化方案相比，我们的方法具有显著的量化效率和量化速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Low-Latency, Highly-Pipelined Hardware Architecture for H.266/VVC Dependent Quantization

In comparison to H.265/HEVC, H.266/VVC introduces a novel quantization tool—dependent quantization, which significantly reduces the rate while maintaining the same video quality. However, due to the quantization process of the transform coefficients being highly dependent on the quantization results of the preceding coefficients, the computational parallelism is low, making it unsuitable for hardware pipeline processing and difficult to achieve real-time encoding. To enhance parallelism, this paper optimizes the rate estimation algorithm based on dependent quantization and designs a multi-quantization state parallel quantization structure, implementing a pipeline-based dependent quantization hardware architecture. The main contributions of this paper are as follows: 1) A hardware-friendly rate estimation algorithm is proposed for calculating the quantization level rate-distortion cost, eliminating the dependency on context templates. 2) A multi state parallel quantization hardware structure is designed to improve the quantization parallelism. Among the multiple generated quantization paths, the shortest quantization path is output by comparing the cumulative rate-distortion cost. Additionally, two trellis memories are introduced during the quantization level output phase, using a ping-pong operation to maximize the output throughput of the quantization module. 3)An 8-stage pipeline computation architecture is proposed for dependent quantization, and the dependent quantization hardware module is implemented, with a computing performance capable of quantizing one transform coefficient per cycle. Experimental results show that the dependent quantization hardware module designed in this paper achieves a maximum frequency of 276MHz, with encoding average speed reaching

$3840\times 2160$

@31.4,83.5,164.5,242.8fps under QP = 22,27,32,37 conditions. In both All Intra and Random Access configurations, the Bjontegaard Delta Bitrate (BDBR) only increases by 0.81% and 0.85% compared to the standard reference software VTM18.0, respectively. Compared to existing hardware quantization schemes, our approach offers outstanding quantization efficiency and quantization speed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.