VCNPU：一种加速神经视频压缩的算法-硬件协同优化框架

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-12-17 DOI:10.1109/TVLSI.2024.3515113

Siyu Zhang;Wendong Mao;Zhongfeng Wang

{"title":"VCNPU：一种加速神经视频压缩的算法-硬件协同优化框架","authors":"Siyu Zhang;Wendong Mao;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3515113","DOIUrl":null,"url":null,"abstract":"Video compression is essential for storing and transmitting video content. Real-time decoding is indispensable for delivering a seamless user experience. Neural video compression (NVC) integrates traditional coding techniques with deep learning, resulting in impressive compression efficiency. However, the real-time deployment of advanced NVC models encounters challenges due to their high complexity and extensive off-chip memory access. This article presents a novel NVC accelerator, called video compression neural processing unit (VCNPU), via an algorithm-hardware co-design framework. First, at the algorithmic level, a reparameterizable video compression network (RepVCN) is proposed to aggregate multiscale features and boost video compression quality. RepVCN can be equivalently transformed into a streamlined structure without extra computations after training. Second, a mask-sharing pruning strategy is proposed to compress RepVCN in the fast transform domain. It effectively prevents the destruction of sparse patterns caused by model simplification, maintaining the model capacity. Third, at the hardware level, a reconfigurable sparse computing module is designed to flexibly support sparse fast convolutions and deconvolutions of the compact RepVCN. Besides, a hybrid layer fusion pipeline is advocated to reduce off-chip data communication caused by extensive motion and residual features. Finally, based on the joint optimization of computation and communication, our VCNPU is constructed to realize adaptive adjustments of various decoding qualities and is implemented under TSMC 28-nm CMOS technology. Extensive experiments demonstrate that our RepVCN provides superior coding quality over other video compression baselines. Meanwhile, our VCNPU achieves <inline-formula> <tex-math>$6.7\\times $ </tex-math></inline-formula> improvements in throughput, <inline-formula> <tex-math>$2.9\\times $ </tex-math></inline-formula> in area efficiency, and <inline-formula> <tex-math>$4\\times $ </tex-math></inline-formula> in energy efficiency compared to prior video processors.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1014-1027"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VCNPU: An Algorithm-Hardware Co-Optimized Framework for Accelerating Neural Video Compression\",\"authors\":\"Siyu Zhang;Wendong Mao;Zhongfeng Wang\",\"doi\":\"10.1109/TVLSI.2024.3515113\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video compression is essential for storing and transmitting video content. Real-time decoding is indispensable for delivering a seamless user experience. Neural video compression (NVC) integrates traditional coding techniques with deep learning, resulting in impressive compression efficiency. However, the real-time deployment of advanced NVC models encounters challenges due to their high complexity and extensive off-chip memory access. This article presents a novel NVC accelerator, called video compression neural processing unit (VCNPU), via an algorithm-hardware co-design framework. First, at the algorithmic level, a reparameterizable video compression network (RepVCN) is proposed to aggregate multiscale features and boost video compression quality. RepVCN can be equivalently transformed into a streamlined structure without extra computations after training. Second, a mask-sharing pruning strategy is proposed to compress RepVCN in the fast transform domain. It effectively prevents the destruction of sparse patterns caused by model simplification, maintaining the model capacity. Third, at the hardware level, a reconfigurable sparse computing module is designed to flexibly support sparse fast convolutions and deconvolutions of the compact RepVCN. Besides, a hybrid layer fusion pipeline is advocated to reduce off-chip data communication caused by extensive motion and residual features. Finally, based on the joint optimization of computation and communication, our VCNPU is constructed to realize adaptive adjustments of various decoding qualities and is implemented under TSMC 28-nm CMOS technology. Extensive experiments demonstrate that our RepVCN provides superior coding quality over other video compression baselines. Meanwhile, our VCNPU achieves <inline-formula> <tex-math>$6.7\\\\times $ </tex-math></inline-formula> improvements in throughput, <inline-formula> <tex-math>$2.9\\\\times $ </tex-math></inline-formula> in area efficiency, and <inline-formula> <tex-math>$4\\\\times $ </tex-math></inline-formula> in energy efficiency compared to prior video processors.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 4\",\"pages\":\"1014-1027\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10804689/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10804689/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

视频压缩是存储和传输视频内容的关键。实时解码对于提供无缝的用户体验是必不可少的。神经视频压缩（NVC）将传统的编码技术与深度学习技术相结合，产生了令人印象深刻的压缩效率。然而，先进的NVC模型由于其高复杂性和广泛的片外存储器访问而面临着实时部署的挑战。本文通过算法-硬件协同设计框架提出了一种新的NVC加速器，称为视频压缩神经处理单元（VCNPU）。首先，在算法层面，提出了一种可重新参数化的视频压缩网络（RepVCN）来聚合多尺度特征，提高视频压缩质量。RepVCN在训练后无需额外计算即可等效地转换为流线型结构。其次，提出了一种掩码共享修剪策略，在快速变换域中压缩RepVCN；有效防止了模型简化对稀疏模式的破坏，保持了模型容量。第三，在硬件层面，设计了可重构的稀疏计算模块，灵活支持精简RepVCN的稀疏快速卷积和反卷积。此外，还提出了一种混合层融合管道，以减少由于大量运动和残留特征造成的片外数据通信。最后，基于计算和通信的联合优化，构建了VCNPU，实现了各种解码质量的自适应调整，并在台积电28纳米CMOS技术下实现。大量的实验表明，我们的RepVCN提供了优于其他视频压缩基线的编码质量。同时，与之前的视频处理器相比，我们的VCNPU在吞吐量方面提高了6.7倍，面积效率提高了2.9倍，能源效率提高了4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

VCNPU: An Algorithm-Hardware Co-Optimized Framework for Accelerating Neural Video Compression

Video compression is essential for storing and transmitting video content. Real-time decoding is indispensable for delivering a seamless user experience. Neural video compression (NVC) integrates traditional coding techniques with deep learning, resulting in impressive compression efficiency. However, the real-time deployment of advanced NVC models encounters challenges due to their high complexity and extensive off-chip memory access. This article presents a novel NVC accelerator, called video compression neural processing unit (VCNPU), via an algorithm-hardware co-design framework. First, at the algorithmic level, a reparameterizable video compression network (RepVCN) is proposed to aggregate multiscale features and boost video compression quality. RepVCN can be equivalently transformed into a streamlined structure without extra computations after training. Second, a mask-sharing pruning strategy is proposed to compress RepVCN in the fast transform domain. It effectively prevents the destruction of sparse patterns caused by model simplification, maintaining the model capacity. Third, at the hardware level, a reconfigurable sparse computing module is designed to flexibly support sparse fast convolutions and deconvolutions of the compact RepVCN. Besides, a hybrid layer fusion pipeline is advocated to reduce off-chip data communication caused by extensive motion and residual features. Finally, based on the joint optimization of computation and communication, our VCNPU is constructed to realize adaptive adjustments of various decoding qualities and is implemented under TSMC 28-nm CMOS technology. Extensive experiments demonstrate that our RepVCN provides superior coding quality over other video compression baselines. Meanwhile, our VCNPU achieves

$6.7\times $

improvements in throughput,

$2.9\times $

in area efficiency, and

$4\times $

in energy efficiency compared to prior video processors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.