{"title":"VCNPU: An Algorithm-Hardware Co-Optimized Framework for Accelerating Neural Video Compression","authors":"Siyu Zhang;Wendong Mao;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3515113","DOIUrl":null,"url":null,"abstract":"Video compression is essential for storing and transmitting video content. Real-time decoding is indispensable for delivering a seamless user experience. Neural video compression (NVC) integrates traditional coding techniques with deep learning, resulting in impressive compression efficiency. However, the real-time deployment of advanced NVC models encounters challenges due to their high complexity and extensive off-chip memory access. This article presents a novel NVC accelerator, called video compression neural processing unit (VCNPU), via an algorithm-hardware co-design framework. First, at the algorithmic level, a reparameterizable video compression network (RepVCN) is proposed to aggregate multiscale features and boost video compression quality. RepVCN can be equivalently transformed into a streamlined structure without extra computations after training. Second, a mask-sharing pruning strategy is proposed to compress RepVCN in the fast transform domain. It effectively prevents the destruction of sparse patterns caused by model simplification, maintaining the model capacity. Third, at the hardware level, a reconfigurable sparse computing module is designed to flexibly support sparse fast convolutions and deconvolutions of the compact RepVCN. Besides, a hybrid layer fusion pipeline is advocated to reduce off-chip data communication caused by extensive motion and residual features. Finally, based on the joint optimization of computation and communication, our VCNPU is constructed to realize adaptive adjustments of various decoding qualities and is implemented under TSMC 28-nm CMOS technology. Extensive experiments demonstrate that our RepVCN provides superior coding quality over other video compression baselines. Meanwhile, our VCNPU achieves <inline-formula> <tex-math>$6.7\\times $ </tex-math></inline-formula> improvements in throughput, <inline-formula> <tex-math>$2.9\\times $ </tex-math></inline-formula> in area efficiency, and <inline-formula> <tex-math>$4\\times $ </tex-math></inline-formula> in energy efficiency compared to prior video processors.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1014-1027"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10804689/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Video compression is essential for storing and transmitting video content. Real-time decoding is indispensable for delivering a seamless user experience. Neural video compression (NVC) integrates traditional coding techniques with deep learning, resulting in impressive compression efficiency. However, the real-time deployment of advanced NVC models encounters challenges due to their high complexity and extensive off-chip memory access. This article presents a novel NVC accelerator, called video compression neural processing unit (VCNPU), via an algorithm-hardware co-design framework. First, at the algorithmic level, a reparameterizable video compression network (RepVCN) is proposed to aggregate multiscale features and boost video compression quality. RepVCN can be equivalently transformed into a streamlined structure without extra computations after training. Second, a mask-sharing pruning strategy is proposed to compress RepVCN in the fast transform domain. It effectively prevents the destruction of sparse patterns caused by model simplification, maintaining the model capacity. Third, at the hardware level, a reconfigurable sparse computing module is designed to flexibly support sparse fast convolutions and deconvolutions of the compact RepVCN. Besides, a hybrid layer fusion pipeline is advocated to reduce off-chip data communication caused by extensive motion and residual features. Finally, based on the joint optimization of computation and communication, our VCNPU is constructed to realize adaptive adjustments of various decoding qualities and is implemented under TSMC 28-nm CMOS technology. Extensive experiments demonstrate that our RepVCN provides superior coding quality over other video compression baselines. Meanwhile, our VCNPU achieves $6.7\times $ improvements in throughput, $2.9\times $ in area efficiency, and $4\times $ in energy efficiency compared to prior video processors.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.