BiVM: Accurate Binarized Neural Network for Efficient Video Matting

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-07-02 DOI:10.1109/TPAMI.2025.3584928

Haotong Qin;Xianglong Liu;Xudong Ma;Lei Ke;Yulun Zhang;Jie Luo;Michele Magno

{"title":"BiVM: Accurate Binarized Neural Network for Efficient Video Matting","authors":"Haotong Qin;Xianglong Liu;Xudong Ma;Lei Ke;Yulun Zhang;Jie Luo;Michele Magno","doi":"10.1109/TPAMI.2025.3584928","DOIUrl":null,"url":null,"abstract":"Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representations from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in fullprecision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 10","pages":"9250-9265"},"PeriodicalIF":18.6000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11060852/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representations from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in fullprecision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.

查看原文本刊更多论文

用于高效视频抠图的精确二值化神经网络

用于实时视频拼接的深度神经网络在边缘设备上存在显著的计算限制，阻碍了它们在在线会议和短视频制作等广泛应用中的采用。二值化是最常用的压缩方法之一，具有紧凑的1位参数和高效的按位操作。然而，二值化视频抠图网络由于编码器的退化和解码器的冗余存在精度和效率的限制。根据信息瓶颈原理进行理论分析，其局限性主要是由于中间特征中预测相关信息的退化和预测无关区域的冗余计算。我们提出了一种精确且资源高效的视频抠图二值化神经网络。首先，我们提出了一系列具有弹性快捷方式和可进化拓扑的二值化计算结构，使构建的编码器骨干能够从输入视频中提取高质量的表示以进行准确预测。其次，我们通过屏蔽同质部分来稀疏二值化解码器的中间特征，使解码器能够专注于具有不同细节的表示，同时减轻了有效推理的计算负担。此外，我们利用信息导向策略构建了一个局部二值化感知的模拟框架，促使全精度对等体中的抠图相关表示得到准确和充分的利用。综合实验表明，所提出的BiVM大大优于其他二值化视频抠图网络，包括最先进的（SOTA）二值化方法。此外，我们的BiVM在计算成本和存储成本方面分别节省了14.3倍和21.6倍。我们还在ARM CPU硬件上评估了BiVM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量