{"title":"H.266/VVC相关量化的低延迟、高流水线硬件架构","authors":"Jun Zhang;Weizhi Bian;Hao Zhang","doi":"10.1109/TCSI.2025.3575567","DOIUrl":null,"url":null,"abstract":"In comparison to H.265/HEVC, H.266/VVC introduces a novel quantization tool—dependent quantization, which significantly reduces the rate while maintaining the same video quality. However, due to the quantization process of the transform coefficients being highly dependent on the quantization results of the preceding coefficients, the computational parallelism is low, making it unsuitable for hardware pipeline processing and difficult to achieve real-time encoding. To enhance parallelism, this paper optimizes the rate estimation algorithm based on dependent quantization and designs a multi-quantization state parallel quantization structure, implementing a pipeline-based dependent quantization hardware architecture. The main contributions of this paper are as follows: 1) A hardware-friendly rate estimation algorithm is proposed for calculating the quantization level rate-distortion cost, eliminating the dependency on context templates. 2) A multi state parallel quantization hardware structure is designed to improve the quantization parallelism. Among the multiple generated quantization paths, the shortest quantization path is output by comparing the cumulative rate-distortion cost. Additionally, two trellis memories are introduced during the quantization level output phase, using a ping-pong operation to maximize the output throughput of the quantization module. 3)An 8-stage pipeline computation architecture is proposed for dependent quantization, and the dependent quantization hardware module is implemented, with a computing performance capable of quantizing one transform coefficient per cycle. Experimental results show that the dependent quantization hardware module designed in this paper achieves a maximum frequency of 276MHz, with encoding average speed reaching <inline-formula> <tex-math>$3840\\times 2160$ </tex-math></inline-formula>@31.4,83.5,164.5,242.8fps under QP = 22,27,32,37 conditions. In both All Intra and Random Access configurations, the Bjontegaard Delta Bitrate (BDBR) only increases by 0.81% and 0.85% compared to the standard reference software VTM18.0, respectively. Compared to existing hardware quantization schemes, our approach offers outstanding quantization efficiency and quantization speed.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 8","pages":"4040-4051"},"PeriodicalIF":5.2000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Low-Latency, Highly-Pipelined Hardware Architecture for H.266/VVC Dependent Quantization\",\"authors\":\"Jun Zhang;Weizhi Bian;Hao Zhang\",\"doi\":\"10.1109/TCSI.2025.3575567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In comparison to H.265/HEVC, H.266/VVC introduces a novel quantization tool—dependent quantization, which significantly reduces the rate while maintaining the same video quality. However, due to the quantization process of the transform coefficients being highly dependent on the quantization results of the preceding coefficients, the computational parallelism is low, making it unsuitable for hardware pipeline processing and difficult to achieve real-time encoding. To enhance parallelism, this paper optimizes the rate estimation algorithm based on dependent quantization and designs a multi-quantization state parallel quantization structure, implementing a pipeline-based dependent quantization hardware architecture. The main contributions of this paper are as follows: 1) A hardware-friendly rate estimation algorithm is proposed for calculating the quantization level rate-distortion cost, eliminating the dependency on context templates. 2) A multi state parallel quantization hardware structure is designed to improve the quantization parallelism. Among the multiple generated quantization paths, the shortest quantization path is output by comparing the cumulative rate-distortion cost. Additionally, two trellis memories are introduced during the quantization level output phase, using a ping-pong operation to maximize the output throughput of the quantization module. 3)An 8-stage pipeline computation architecture is proposed for dependent quantization, and the dependent quantization hardware module is implemented, with a computing performance capable of quantizing one transform coefficient per cycle. Experimental results show that the dependent quantization hardware module designed in this paper achieves a maximum frequency of 276MHz, with encoding average speed reaching <inline-formula> <tex-math>$3840\\\\times 2160$ </tex-math></inline-formula>@31.4,83.5,164.5,242.8fps under QP = 22,27,32,37 conditions. In both All Intra and Random Access configurations, the Bjontegaard Delta Bitrate (BDBR) only increases by 0.81% and 0.85% compared to the standard reference software VTM18.0, respectively. Compared to existing hardware quantization schemes, our approach offers outstanding quantization efficiency and quantization speed.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 8\",\"pages\":\"4040-4051\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11027472/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11027472/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Low-Latency, Highly-Pipelined Hardware Architecture for H.266/VVC Dependent Quantization
In comparison to H.265/HEVC, H.266/VVC introduces a novel quantization tool—dependent quantization, which significantly reduces the rate while maintaining the same video quality. However, due to the quantization process of the transform coefficients being highly dependent on the quantization results of the preceding coefficients, the computational parallelism is low, making it unsuitable for hardware pipeline processing and difficult to achieve real-time encoding. To enhance parallelism, this paper optimizes the rate estimation algorithm based on dependent quantization and designs a multi-quantization state parallel quantization structure, implementing a pipeline-based dependent quantization hardware architecture. The main contributions of this paper are as follows: 1) A hardware-friendly rate estimation algorithm is proposed for calculating the quantization level rate-distortion cost, eliminating the dependency on context templates. 2) A multi state parallel quantization hardware structure is designed to improve the quantization parallelism. Among the multiple generated quantization paths, the shortest quantization path is output by comparing the cumulative rate-distortion cost. Additionally, two trellis memories are introduced during the quantization level output phase, using a ping-pong operation to maximize the output throughput of the quantization module. 3)An 8-stage pipeline computation architecture is proposed for dependent quantization, and the dependent quantization hardware module is implemented, with a computing performance capable of quantizing one transform coefficient per cycle. Experimental results show that the dependent quantization hardware module designed in this paper achieves a maximum frequency of 276MHz, with encoding average speed reaching $3840\times 2160$ @31.4,83.5,164.5,242.8fps under QP = 22,27,32,37 conditions. In both All Intra and Random Access configurations, the Bjontegaard Delta Bitrate (BDBR) only increases by 0.81% and 0.85% compared to the standard reference software VTM18.0, respectively. Compared to existing hardware quantization schemes, our approach offers outstanding quantization efficiency and quantization speed.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.