{"title":"An Online-Training-Free Adaptor for Open Heterogeneous Collaborative Perception via Diffusion Model","authors":"Tianhang Wang;Fan Lu;Sanqing Qu;Bin Li;Ya Wu;Hu Cao;Alois Knoll;Guang Chen","doi":"10.1109/TCSVT.2025.3628726","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3628726","url":null,"abstract":"Collaborative perception seeks to mitigate the limitations of single-vehicle perception, such as occlusions, by facilitating communication and information sharing among connected vehicles. However, most existing works assume a homogeneous scenario where all vehicles share identity sensor types and perception model architectures. In contrast, real-world systems often involve heterogeneous agents with diverse sensor configurations and independently developed models. In such settings, directly exchanging features without proper alignment can significantly degrade performance and hinder effective collaboration. While some methods have been proposed to address heterogeneity, they typically require retraining or access to internal model parameters, making them impractical for scalable deployment. To address these challenges, we propose DiffAlign, a plug-and-play adapter that enables feature alignment across heterogeneous agents in a training-free and model-agnostic manner. DiffAlign treats received BEV features as noisy latent representations and progressively refines them through a pretrained diffusion process. This alignment strategy does not require access to model internals or any retraining, which makes it both scalable and privacy-preserving while supporting diverse sensor modalities and perception backbones. Extensive experiments on simulated OPV2V and real-world V2V4Real datasets demonstrate that DiffAlign consistently improves detection performance in heterogeneous settings, improving CoBEVT by 132.01% and 91.95%, respectively. Our method provides a practical path toward scalable, generalizable, and deployment-ready collaborative perception.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5729-5741"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Na Wang;Pengpeng Li;Lin Huang;Fang Cao;Xinyi Wang;Chuan Qin
{"title":"ReFHD-Net: A Reversible Functionality Hiding Framework for Deep Neural Networks","authors":"Na Wang;Pengpeng Li;Lin Huang;Fang Cao;Xinyi Wang;Chuan Qin","doi":"10.1109/TCSVT.2025.3632812","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3632812","url":null,"abstract":"With the rapid development of artificial intelligence, deep neural networks (DNN) have become valuable digital assets, thereby highlighting the urgent need for copyright protection and secure transmission. Although traditional model watermarking and active defense techniques offer partial protection against unauthorized use, they often suffer from limited imperceptibility and may degrade model performance. To overcome these challenges, this paper proposes ReFHD-Net, a reversible functionality hiding framework for DNN based on a structured mask matrix. Here, reversible functionality hiding refers to the ability to hide the functionality of secret task within the stego model during transmission and enable its lossless recovery by authorized users at the receiver side. Specifically, ReFHD-Net employs a two-stage strategy to hide the secret functionality within a carrier model. In the first stage, a multi-task learning framework enhanced with homoscedastic uncertainty is employed to jointly train the model on both public and secret tasks. In the second stage, the model parameters are further optimized using a combination of task-driven loss and parameter distribution regularization, which limits parameter deviations caused by the hiding process and enhances the imperceptibility of the secret task. Experimental results on image classification and denoising benchmarks validate the superiority of our ReFHD-Net. It achieves an average degradation of only 0.27% in public task and enables lossless recovery of the secret task with no performance drop. Moreover, our framework exhibits strong robustness and security against various unauthorized recovery attempts including random guessing, fine-tuning, and model pruning.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5683-5695"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Yang;Hexun Zhou;De Cheng;Menghui Tian;Nannan Wang
{"title":"Semantic-Interactive Clustering Optimization With SAM for Weakly Supervised Person Search","authors":"Xi Yang;Hexun Zhou;De Cheng;Menghui Tian;Nannan Wang","doi":"10.1109/TCSVT.2025.3636572","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3636572","url":null,"abstract":"Weakly-supervised person search presents significant challenges when relying solely on bounding-box annotations, particularly due to inter-class confusion from clothing similarity and intra-class variations caused by illumination changes, which severely degrade cross-view matching accuracy. Existing clustering-based methods, constrained by their heavy dependence on color features, frequently produce unreliable pseudo-labels that ultimately limit model performance. To overcome these limitations, we present Segment Anything Model-based Semantic-Interactive Clustering Optimization (SAM-SICO), a novel framework that integrates the Segment Anything Model’s semantic segmentation capability with adaptive clustering optimization for weakly-supervised person search. Our framework harnesses the representational power of the Segment Anything Model (SAM) to enable detector-free semantic feature learning while significantly improving clustering precision. The proposed solution makes three key advances: the Semantic Contour Embedding (SCE) module leverages SAM’s zero-shot segmentation capability to produce highly accurate human body masks; the Relation-driven Semantic Feature Interaction (RSFI) mechanism effectively mitigates clothing-color bias through innovative dynamic affinity matrix construction across multiscale semantic masks and visual features; and the Adaptive Clustering Optimization (ACO) algorithm introduces parameter adaptation to optimize intra-class compactness and inter-class separation metrics. Experimental results show that our method outperforms existing state-of-the-art approaches on the PRW and CUHK-SYSU datasets. The source code is available at <uri>https://github.com//HawlsonZ/SAM-SICO</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5642-5654"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pseudocylindrical Convolutions for Learned Omnidirectional Image Compression","authors":"Mu Li;Kede Ma;Jinxing Li;David Zhang","doi":"10.1109/TCSVT.2025.3638018","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3638018","url":null,"abstract":"Equirectangular projection (ERP) is a convenient form to store omnidirectional images, but it is neither equal-area nor conformal, creating challenges for subsequent visual communication. When used for image compression, ERP amplifies sampling density and deforms objects near the poles, hindering perceptually optimal bit allocation. Here, we present one of the earliest endeavors to apply deep neural networks to omnidirectional image compression. We first propose parametric pseudocylindrical representations that generalize common pseudocylindrical map projections. A tractable greedy algorithm is introduced to identify (sub-)optimal representation configurations, guided by a proxy objective for rate-distortion performance. We then develop pseudocylindrical convolutions, which can be efficiently implemented by standard convolutions with “pseudocylindrical padding.” To demonstrate the utility of the proposed pseudocylindrical representations and convolutions, we implement an end-to-end omnidirectional image compression method, consisting of an analysis transform, a uniform quantizer, a synthesis transform, and an entropy model. Experiments show that our optimized method achieves consistently better rate-distortion performance compared to the state-of-the-art.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5497-5509"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Yuan;Zhaoqing Pan;Jianjun Lei;Bo Peng;Haoran Xie;Fu Lee Wang;Sam Kwong
{"title":"Mining Temporal Redundancy Using Long Short-Term Motion Aggregation and Global–Local Decorrelation for Learned Video Compression","authors":"Feng Yuan;Zhaoqing Pan;Jianjun Lei;Bo Peng;Haoran Xie;Fu Lee Wang;Sam Kwong","doi":"10.1109/TCSVT.2025.3638161","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3638161","url":null,"abstract":"The conditional coding paradigm is widely used in learned video compression, which shows superior performance in capturing redundancies within a large context space. However, existing Conditional coding-based Learned Video Compression (C-LVC) methods ignore that the predicted motion vectors usually contain large uncertainty due to complex motions, occlusions, etc., which consequently decrease the accuracy of the generated temporal contexts. In addition, existing C-LVC methods have a weak ability to mine diverse dependencies within the context space, which are closely related to the coding efficiency. To address these issues, an efficient temporal redundancy mining method is proposed to improve the coding efficiency of C-LVC in this paper. To generate accurate temporal contexts, a Long Short-Term Motion Aggregation (LSTMA) model is proposed, in which an LSTMA-based motion estimation module is developed to capture both current and aggregated long short-term motion information to reduce the uncertainty of predicted motion vectors. Based on the dual motion information, an LSTMA-based temporal context mining module is developed to exploit the aggregated long short-term motion information and increase the accuracy of the generated temporal contexts. In order to fully eliminate spatial-temporal redundancies in a video, a Global-Local Information Decorrelation Module (GLIDM)-based context codec is proposed, in which the GLIDM is designed based on the visual state space block (namely vmamba), the residual block, and the squeeze-and-excitation block to effectively capture long-range, short-range spatial-temporal dependencies and channel-wise dependencies. Experimental results demonstrate that our proposed method can effectively improve the coding performance of C-LVC, and outperforms other state-of-the-art LVC methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5510-5524"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Syntax Element Encryption for H.265/HEVC Using Chaotic Map-Based Coefficient Scrambling Scheme","authors":"Liang-Wei Li;Chung-Nan Lee;Kishu Gupta;Huei-Fang Yang;Ashutosh Kumar Singh","doi":"10.1109/TCSVT.2025.3625077","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3625077","url":null,"abstract":"In today’s digital landscape, high-efficiency video coding (H.265/HEVC) has emerged as the most widely used video coding standard, employing selective encryption schemes to protect the privacy of video content while maintaining efficient compression performance. However, existing coefficient scrambling methods impose a significant computational load, leading to increased bit rate overhead due to encryption, longer execution times, and insufficient safety measures. To address these issues, a new coefficient scrambling scheme based on <italic>chaotic maps</i> is proposed. This approach leverages the pseudorandomness, ergodicity, and sensitivity to initial conditions inherent in chaotic maps to generate highly unpredictable coefficient distributions, thereby strengthening security while preserving low complexity. Unlike conventional scrambling, chaotic maps ensure minimal correlation between encrypted coefficients, enhancing resistance against statistical and differential attacks. Additionally, the scrambling conditions are specifically designed to minimize the impact on the bit rate overhead. Furthermore, when combined with syntax element encryption (SEC), which includes motion vector difference (MVD), quantized transform coefficients (QTC), and luma intraprediction mode (Luma IPM), this method effectively distorts video content. The proposed scheme operates synchronously with slices, ensuring that the decryption of video content remains intact even if some slices are lost. Additionally, a random sequence generated by AES-CTR is incorporated with the H.265 encoded stream to protect against chosen-plaintext attacks. The experimental results indicate that this scheme features high security, compliance with format standards, fast execution times, synchronous updates with slices, and resilience against common attacks, all while achieving a reduced bit rate overhead of 45.13% with a lowered average execution time overhead of 1.91%.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5655-5670"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Na Wang;Chenyi Xu;Fang Cao;Lin Huang;Wei Wang;Chuan Qin
{"title":"Steganography via Neural Network Parameter Initialization With High Fidelity and Imperceptibility","authors":"Na Wang;Chenyi Xu;Fang Cao;Lin Huang;Wei Wang;Chuan Qin","doi":"10.1109/TCSVT.2025.3634666","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3634666","url":null,"abstract":"In recent years, with the rapid advancement of deep neural networks (DNNs), researchers have explored steganography techniques that use DNN models as carriers for secret information hiding. However, existing methods generally suffer from limited imperceptibility and fidelity. To address these limitations, this paper proposes a steganography method that achieves high imperceptibility and fidelity while providing substantial embedding capacity and robustness. Specifically, we introduce a dual-branch encoder that embeds secret information into the initialization parameters of the cover model with almost no degradation of the model’s functionality. In addition, a new SMSE loss is employed to constrain the encoder output, which enhances the imperceptibility of the stego model. After training and transmission, the receiver can utilize a decoder to accurately extract secret information from the stego model. Experimental results demonstrate that the proposed method achieves a Kullback–Leibler (KL) divergence more than an order of magnitude lower than existing methods, with values ranging from 0.0003 to 0.007. The stego model preserves high fidelity to the original model, with classification accuracy differences within 0.005 on benchmark datasets including MNIST, CIFAR-10, and SST-2. In terms of embedding capacity, it achieves 319,312 bits on ResNet-18 and 1,757,952 bits on ViT, which exceeds the performance of baseline methods across most models. Furthermore, the proposed method exhibits strong robustness, as the embedded information can still be accurately recovered with BCH coding even under noise attacks at an SNR as low as -6 dB. It also demonstrates strong generalization, as it performs effectively on both classification networks and generative or reconstruction models such as GANs, VAEs, and U-Nets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5696-5713"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jimin Dai;Yingzhen Zhang;Shuo Chen;Jian Yang;Lei Luo
{"title":"ERDDCI: Exact Reversible Diffusion via Dual-Chain Inversion for High-Quality Image Editing","authors":"Jimin Dai;Yingzhen Zhang;Shuo Chen;Jian Yang;Lei Luo","doi":"10.1109/TCSVT.2025.3638406","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3638406","url":null,"abstract":"Diffusion models (DMs) have been successfully applied to real image editing. These models typically invert images into latent noise vectors during the inversion process, and then edit them during the inference process. However, DMs often rely on the local linearization assumption, which assumes that the noise injected during the inversion process approximates the noise removed during the inference process. While DMs efficiently generate images under this assumption, it also accumulates errors during the diffusion process due to the assumption, ultimately negatively impacting the quality of real image reconstruction and editing. To address this issue, we propose a novel ERDDCI (Exact Reversible Diffusion via Dual-Chain Inversion). ERDDCI uses the new Dual-Chain Inversion (DCI) for joint inference to derive an exact reversible diffusion process. Using DCI, our method avoids the cumbersome optimization process in existing inversion approaches and achieves high-quality image editing. Additionally, to accommodate image operations under high guidance scales, we introduce a dynamic control strategy that enables more refined image reconstruction and editing. Our experiments demonstrate that ERDDCI significantly outperforms state-of-the-art methods in a 50-step diffusion process. It achieves rapid and precise image reconstruction with SSIM of 0.999 and LPIPS of 0.001, and delivers competitive results in image editing. The source code is available at: <uri>https://github.com/daii-y/ERDDCI</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5437-5452"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Modal Generative AI: Multi-Modal LLMs, Diffusions, and the Unification","authors":"Xin Wang;Yuwei Zhou;Bin Huang;Hong Chen;Wenwu Zhu","doi":"10.1109/TCSVT.2025.3635224","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3635224","url":null,"abstract":"Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for <italic>multi-modal understanding</i>; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of <italic>multi-modal generation</i>. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models, respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions that may contribute to the ongoing advancement of multi-modal generative AI.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5621-5641"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeskPred: Two-Stage Video Stream Bandwidth Prediction for Cold-Start and Training Forgetting in Cloud Desktops","authors":"Zuodong Jin;Dan Tao;Peng Qi;Jiayu Zhang;Gang Han;Ruipeng Gao","doi":"10.1109/TCSVT.2025.3632222","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3632222","url":null,"abstract":"As a cloud-hosted virtual desktop service, cloud desktop supports various fields such as telecommuting, collaborative development, while enabling real-time user interaction through video stream. The stability of this process is determined by bandwidth, which significantly influences the user experience. Therefore, precise bandwidth prediction of video streams is essential in cloud desktops. This work proposes DeskPred for video stream transmission in cloud desktops, focusing on dynamic bandwidth prediction. In the startup stage, the limited data amount poses a challenge for achieving precise bandwidth predictions. We propose an Affinity-based Federated Learning algorithm, which leverages the historical records of high-affinity users for assisted training, all while protecting user privacy. During the long-term adjustment stage, we propose a Fluctuation-based Adaptive Incremental Prediction algorithm for independent training to address the issue of pattern forgetting. The algorithm considers both periodic features and instantaneous features, incorporating new patterns while revisiting previous knowledge through the memory module and Adversarial Elastic Weight Consolidation. We have verified DeskPred through an actual cloud desktop project supported by Lenovo Research. Through experiments conducted on a total of over 18 million data items (approximately 10 GB), DeskPred achieves the highest total score of 71.11%, making it highly suitable for cloud desktop environments.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5595-5607"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}