{"title":"AdaptGCD: Multi-Expert Adapter Tuning for Generalized Category Discovery","authors":"Yuxun Qu;Yongqiang Tang;Chenyang Zhang;Wensheng Zhang","doi":"10.1109/TCSVT.2025.3602981","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602981","url":null,"abstract":"Different from the traditional semi-supervised learning paradigm that is constrained by the close-world assumption, Generalized Category Discovery (GCD) presumes that the unlabeled dataset contains new categories not appearing in the labeled set, and aims to not only classify old categories but also discover new categories in the unlabeled data. Existing studies on GCD typically devote to transferring the general knowledge from the self-supervised pretrained model to the target GCD task via some fine-tuning strategies, such as partial tuning and prompt learning. Nevertheless, these fine-tuning methods fail to make a sound balance between the generalization capacity of pretrained backbone and the adaptability to the GCD task. To fill this gap, in this paper, we propose a novel adapter-tuning-based method named AdaptGCD, which is the first work to introduce the adapter tuning into the GCD task and provides some key insights expected to enlighten future research. Furthermore, considering the discrepancy of supervision information between the old and new classes, a multi-expert adapter structure equipped with a route assignment constraint is elaborately devised, such that the data from old and new classes are separated into different expert groups. Extensive experiments are conducted on 7 widely-used datasets. The remarkable performance improvements highlight the efficacy of our proposal and it can be also combined with other advanced methods like SPTNet for further enhancement.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2344-2357"},"PeriodicalIF":11.1,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zimu Lu;Ning Xu;Hongshuo Tian;Lanjun Wang;An-An Liu
{"title":"Medical VLP Model Is Vulnerable: Toward Multimodal Adversarial Attack on Large Medical Vision-Language Models","authors":"Zimu Lu;Ning Xu;Hongshuo Tian;Lanjun Wang;An-An Liu","doi":"10.1109/TCSVT.2025.3602970","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602970","url":null,"abstract":"Medical Visual Question Answering (Medical VQA) is an essential task that facilitates the automated interpretation of complex clinical imagery with corresponding textual questions, thereby supporting both clinicians and patients in making informed medical decisions. With the rapid progress of Vision-Language Pretraining (VLP) in general domains, the development of medical VLP models has emerged as a rapidly growing interdisciplinary area at the intersection of artificial intelligence (AI) and healthcare. However, few works have been proposed to evaluate the adversarial robustness of medical VLP models, which faces two primary challenges: 1) the complexity of medical texts, stemming from the presence of terminologies, poses significant challenges for models in comprehending the text for adversarial attack; 2) the diversity of medical images arises from the variety of anatomical regions depicted, which requires models to determine critical anatomical regions for attack. In this paper, we propose a novel multimodal adversarial attack generator for evaluating the robustness of medical VLP models. Specifically, for the complexity of medical texts, we integrate medical knowledge when crafting text adversarial samples, which can facilitate the terminologies understanding and adversarial strength; for the diversity of medical images, we divide the anatomical regions into either global or local regions in medical images, which are determined by learned balance weights for perturbations. Our experimental study not only provides a quantitative understanding in medical VLP models, but also underscores the critical need for thorough safety evaluations before implementing them in real-world medical applications.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2478-2491"},"PeriodicalIF":11.1,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Wang;Junyan Huo;Fei Yang;Shuai Wan;Gaoxing Chen;Kun Yang;Luis Herranz;Fuzheng Yang
{"title":"Text and Non-Text Latent Feature Disentanglement for Screen Content Image Compression","authors":"Hao Wang;Junyan Huo;Fei Yang;Shuai Wan;Gaoxing Chen;Kun Yang;Luis Herranz;Fuzheng Yang","doi":"10.1109/TCSVT.2025.3602506","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602506","url":null,"abstract":"With the growing prevalence of screen content images in multimedia communication, efficient compression has become increasingly crucial. Unlike natural scene images, screen content typically contains rich text regions that exhibit unique characteristics and low correlation with surrounding non-text elements. The intricate mixture of text and non-text within images poses significant challenges for existing learned compression networks, as the text and non-text features are severely entangled in the latent domain along the channel dimension, leading to compromised reconstruction quality and suboptimal entropy estimation. In this paper, we propose a novel <bold>Disentangled Image Compression Architecture (DICA)</b> that enhances the analysis module and the entropy model of existing compression architectures to address these limitations. First, we introduce a <bold>Disentangled Analysis Module (DAM)</b> by augmenting original analysis modules with an additional text approximation branch and a disentangling network. They work in concert to disentangle latent features into text and non-text classes along the channel dimension, resulting in a more structured feature distribution that better aligns with compression requirements. Second, we propose a Disentangled Channel-Conditional Entropy Model (DCEM) that efficiently leverages the feature distribution bias introduced by DAM, thereby further improving compression performance. Experimental results demonstrate that the proposed DICA, along with DAM and DCEM can be integrated into various channel-conditional compression backbones, significantly improving their performance in screen content compression–particularly in hard-to-compress text regions. When integrated with an advanced WACNN backbone, our method achieves a 13% overall BD-Rate gain and a 16% BD-Rate gain in text regions on the SIQAD dataset.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2505-2519"},"PeriodicalIF":11.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karama Abdelhedi;Faten Chaabane;Walid Wannes;William Puech;Chokri Ben Amar
{"title":"Phylogeny-Based Traitor Tracing Method for Interleaving Attacks","authors":"Karama Abdelhedi;Faten Chaabane;Walid Wannes;William Puech;Chokri Ben Amar","doi":"10.1109/TCSVT.2025.3602214","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602214","url":null,"abstract":"Today, the popularity of 3D videos is increasing significantly. This trend can be attributed to their immersive appeal and lifelike experience. In an era dominated by the widespread distribution of digital content, data integrity, and ownership, all of these elements are of crucial importance. In this context, the practice of traitor tracing, closely related to Digital Rights Management (DRM), facilitates the identification and tracking of unauthorized users who have violated copyright in order to share illegal copyright-protected content. In this paper, we propose a solution to this problem, we introduce an innovative traitor tracing approach focused on 3D video, with a particular focus on the DIBR (Depth Image-Based Rendering) format, which can be vulnerable to an Interleaving attack strategy. For this purpose, we develop a new phylogeny tree construction method designed to combat collusion attacks. Our experimental evaluations demonstrate the effectiveness of our proposed approach particularly when applied to long fingerprinting codes. Compared to Tardos’ approach, our method delivers very good results, even for a large number of colluders.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2623-2634"},"PeriodicalIF":11.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Luo;Peng Xing;Yunkang Cao;Haiming Yao;Weiming Shen;Zechao Li
{"title":"URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection","authors":"Wei Luo;Peng Xing;Yunkang Cao;Haiming Yao;Weiming Shen;Zechao Li","doi":"10.1109/TCSVT.2025.3602391","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3602391","url":null,"abstract":"Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2464-2477"},"PeriodicalIF":11.1,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Network-Based Adaptive Quantization for Practical Video Coding","authors":"Shuai Huo;Hewei Liu;Jiawen Gu;Dengchao Jin;Meng Lei;Bo Huang;Chao Zhou","doi":"10.1109/TCSVT.2025.3601718","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3601718","url":null,"abstract":"The optimization of block-level quantization parameters (QP) is critical to improving the performance of practical block-based video compression encoders, but the extremely large optimization space makes it challenging to solve. Existing solutions, e.g. HEVC encoder x265, usually add some optimization constraints of the block-independent assumption and linear distortion propagation model, which limits compression efficiency improvement to a certain extent. To address this problem, a deep learning-based encoder-only adaptive quantization method (DAQ) is proposed in this paper, where a deep network is designed to adaptively model the joint temporal propagation relationship of quantization among blocks. Specifically, DAQ consists of two phases: in the training phase, considering the heavy searching cost of the traditional codec, we introduce a well-designed end-to-end learned block-based video compression network as an effective training proxy tool for the deep encoder-side network. While in the deployment phase, the trained deep network is applied to jointly predict all block QPs in a frame for the traditional encoder. Besides, our network deploys only on the encoder side without changing the standard decoder and has very low inference complexity, making it able to apply in practice. At last, we deploy DAQ in HEVC and VVC encoder for performance comparison, and the experimental results demonstrate that DAQ significantly outperforms practically used x265 with on average 15.0%, 10.9% BD-rate reduction under the SSIM and PSNR, and also achieves 12.5%, 5.0% coding gain than VTM. Moreover, for deploying deep video codec in practice, this work provides a new insight for optimizing the encoder parameters with a large space.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2538-2550"},"PeriodicalIF":11.1,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenrong Zhang;Jianan Liu;Yuxuan Xia;Tao Huang;Qing-Long Han;Hongbin Liu
{"title":"LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking With Point Clouds","authors":"Zhenrong Zhang;Jianan Liu;Yuxuan Xia;Tao Huang;Qing-Long Han;Hongbin Liu","doi":"10.1109/TCSVT.2025.3600881","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600881","url":null,"abstract":"Online Multi-Object Tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization, which efficiently formulates the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states to further enhance the state update process. Our proposed method, utilising LiDAR alone, has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked <inline-formula> <tex-math>$3^{rd}$ </tex-math></inline-formula> among all trackers (both online and offline) and <inline-formula> <tex-math>$2^{nd}$ </tex-math></inline-formula> among all online trackers in the KITTI MOT benchmark for cars, (<uri>https://www.cvlibs.net/datasets/kitti/eval_tracking.php</uri>) at the time of submitting results to KITTI object tracking evaluation ranking board. Moreover, our method also achieves competitive performance on the Waymo open dataset benchmark.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2419-2432"},"PeriodicalIF":11.1,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mining Temporal Priors for Template-Generated Video Compression","authors":"Feng Xing;Yingwen Zhang;Meng Wang;Hengyu Man;Yongbing Zhang;Shiqi Wang;Xiaopeng Fan;Wen Gao","doi":"10.1109/TCSVT.2025.3599239","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599239","url":null,"abstract":"The popularity of template-generated videos has recently experienced a significant increase on social media platforms. In general, videos from the same template share similar temporal characteristics, which are unfortunately ignored in the current compression schemes. In view of this, we aim to examine how such temporal priors from templates can be effectively utilized during the compression process for template-generated videos. First, a comprehensive statistical analysis is conducted, revealing that the coding decisions, including the merge, non-affine, and motion information, across template-generated videos are strongly correlated. Subsequently, leveraging such correlations as prior knowledge, a simple yet effective prior-driven compression scheme for template-generated videos is proposed. In particular, a mode decision pruning algorithm is devised to dynamically skip unnecessarily advanced motion vector prediction (AMVP) or affine AMVP decisions. Moreover, an improved AMVP motion estimation algorithm is applied to further accelerate reference frame selection and the motion estimation process. Experimental results on the versatile video coding (VVC) platform VTM-23.0 demonstrate that the proposed scheme achieves moderate time reductions of 14.31% and 14.99% under the Low-Delay P (LDP) and Low-Delay B (LDB) configurations, respectively, while maintaining negligible increases in Bjøntegaard Delta Rate (BD-Rate) of 0.15% and 0.18%, respectively.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1160-1172"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An End-to-End Framework for Joint Makeup Style Transfer and Image Steganography","authors":"Meihong Yang;Ziyi Feng;Bin Ma;Jian Xu;Yongjin Xian;Linna Zhou","doi":"10.1109/TCSVT.2025.3599551","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599551","url":null,"abstract":"Existing image steganography schemes always introduce obvious modification traces to the cover image, resulting in the risk of secret information leakage. To address this issue, an end-to-end framework for joint makeup style transfer and image steganography is proposed in this paper to achieve imperceptible higher-capacity data hiding. In the scheme, a Parsing-guided Semantic Feature Alignment (PSFA) module is designed to transfer the style of a makeup image to an object non-makeup image, thereby generating a content-style integrated feature matrix. Meanwhile, a Multi-Scale Feature Fusion and Data Embedding (MFFDE) module was devised to encode the secret image into its latent features and fuse them with the generated content-style integrated feature matrix, as well as the non-makeup image features across multiple scales, to achieve the makeup-stego image. As a result, the style of the makeup image is well transformed and the secret image is imperceptibly embedded simultaneously without directly modifying the pixels of the original non-makeup image. Additionally, a Residual-aware Information Compensation Network (RICN) is developed to compensate the loss of the secret image arising from the multilevel data embedding, thereby further enhancing the quality of the reconstructed secret image. Experimental results show that the proposed scheme achieves superior steganalysis resistance capability and visual quality in both makeup-stego images and recovered secret images, compared with other state-of-the-art schemes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 1","pages":"1293-1308"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146049267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
You Wu;Yongxin Li;Mengyuan Liu;Xucheng Wang;Xiangyang Yang;Hengzhou Ye;Dan Zeng;Qijun Zhao;Shuiwang Li
{"title":"Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking","authors":"You Wu;Yongxin Li;Mengyuan Liu;Xucheng Wang;Xiangyang Yang;Hengzhou Ye;Dan Zeng;Qijun Zhao;Shuiwang Li","doi":"10.1109/TCSVT.2025.3599856","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3599856","url":null,"abstract":"Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack’s performance while reducing model complexity and boosting average tracking speed by over 17%. Codes is available at <uri>https://github.com/wuyou3474/AVTrack</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 2","pages":"2403-2418"},"PeriodicalIF":11.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}