{"title":"StegMamba: Distortion-Free Immune-Cover for Multi-Image Steganography With State Space Model","authors":"Ting Luo;Yuhang Zhou;Zhouyan He;Gangyi Jiang;Haiyong Xu;Shuren Qi;Yushu Zhang","doi":"10.1109/TCSVT.2024.3515652","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3515652","url":null,"abstract":"Multi-image steganography ensures privacy protection while avoiding suspicion from third parties by embedding multiple secret images within a cover image. However, existing multi-image steganographic methods fail to model global spatial correlations to reduce image damage at the low computation cost. Moreover, they do not account for the anti-distortion capability of the cover image, which is crucial for achieving imperceptible and ensuring security. To overcome these limitations, we propose StegMamba, a distortion-free immune-cover for multi-image steganography architecture with a state space model. Specifically, we first explore the potential of the linear computational cost model Mamba for data hiding tasks through a steganography Mamba block (SMB), whose efficiency makes it suitable for real-time applications. Subsequently, considering that images with distortion resistance reduce embedding damage, the original cover image is reconstructed through immune-cover construction module (ICCM) and associated with the steganography task. Moreover, well-coupled features facilitate fusion, and thus a wavelet-based interaction module (WIM) is designed for effective communication between the immune-cover and the secret images. Compared with the state-of-the-art global attention-based methods, the proposed StegMamba obtains PSNR gains of 3.30 dB, 1.37 dB, and 1.92 dB for the stego image, and two secret recovery images, respectively, and the reduction of 2.87% in detection accuracy for anti-steganalysis. This code is available at <uri>https://github.com/YuhangZhouCJY/StegMamba</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4576-4591"},"PeriodicalIF":8.3,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jie Wang;Xiangji Kong;Nana Yu;Zihao Zhang;Yahong Han
{"title":"Explicitly Disentangling and Exclusively Fusing for Semi-Supervised Bi-Modal Salient Object Detection","authors":"Jie Wang;Xiangji Kong;Nana Yu;Zihao Zhang;Yahong Han","doi":"10.1109/TCSVT.2024.3514897","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3514897","url":null,"abstract":"Bi-modal (RGB-T and RGB-D) salient object detection (SOD) aims to enhance detection performance by leveraging the complementary information between modalities. While significant progress has been made, two major limitations persist. Firstly, mainstream fully supervised methods come with a substantial burden of manual annotation, while weakly supervised or unsupervised methods struggle to achieve satisfactory performance. Secondly, the indiscriminate modeling of local detailed information (object edge) and global contextual information (object body) often results in predicted objects with incomplete edges or inconsistent internal representations. In this work, we propose a novel paradigm to effectively alleviate the above limitations. Specifically, we first enhance the consistency regularization strategy to build a basic semi-supervised architecture for the bi-modal SOD task, which ensures that the model can benefit from massive unlabeled samples while effectively alleviating the annotation burden. Secondly, to ensure detection performance (i.e., complete edges and consistent bodies), we disentangle the SOD task into two parallel sub-tasks: edge integrity fusion prediction and body consistency fusion prediction. Achieving these tasks involves two key steps: 1) the explicitly disentangling scheme decouples salient object features into edge and body features, and 2) the exclusively fusing scheme performs exclusive integrity or consistency fusion for each of them. Eventually, our approach demonstrates significant competitiveness compared to 26 fully supervised methods, while effectively alleviating 90% of the annotation burden. Furthermore, it holds a substantial advantage over 15 non-fully supervised methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4479-4492"},"PeriodicalIF":8.3,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncertainty-Aware Label Refinement on Hypergraphs for Personalized Federated Facial Expression Recognition","authors":"Hu Ding;Yan Yan;Yang Lu;Jing-Hao Xue;Hanzi Wang","doi":"10.1109/TCSVT.2024.3513973","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3513973","url":null,"abstract":"Most facial expression recognition (FER) models are trained on large-scale expression data with centralized learning. Unfortunately, collecting a large amount of centralized expression data is difficult in practice due to privacy concerns of facial images. In this paper, we investigate FER under the framework of personalized federated learning, which is a valuable and practical decentralized setting for real-world applications. To this end, we develop a novel uncertainty-Aware label refineMent on hYpergraphs (AMY) method. For local training, each local model consists of a backbone, an uncertainty estimation (UE) block, and an expression classification (EC) block. In the UE block, we leverage a hypergraph to model complex high-order relationships between expression samples and incorporate these relationships into uncertainty features. A personalized uncertainty estimator is then introduced to estimate reliable uncertainty weights of samples in the local client. In the EC block, we perform label propagation on the hypergraph, obtaining high-quality refined labels for retraining an expression classifier. Based on the above, we effectively alleviate heterogeneous sample uncertainty across clients and learn a robust personalized FER model in each client. Experimental results on two challenging real-world facial expression databases show that our proposed method consistently outperforms several state-of-the-art methods. This indicates the superiority of hypergraph modeling for uncertainty estimation and label refinement on the personalized federated FER task. The source code will be released at <uri>https://github.com/mobei1006/AMY</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4675-4685"},"PeriodicalIF":8.3,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predictive Sample Assignment for Semantically Coherent Out-of-Distribution Detection","authors":"Zhimao Peng;Enguang Wang;Xialei Liu;Ming-Ming Cheng","doi":"10.1109/TCSVT.2024.3514312","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3514312","url":null,"abstract":"Semantically coherent out-of-distribution detection (SCOOD) is a recently proposed realistic OOD detection setting: given labeled in-distribution (ID) data and mixed in-distribution and out-of-distribution unlabeled data as the training data, SCOOD aims to enable the trained model to accurately identify OOD samples in the testing data. Current SCOOD methods mainly adopt various clustering-based in-distribution sample filtering (IDF) strategies to select clean ID samples from unlabeled data, and take the remaining samples as auxiliary OOD data, which inevitably introduces a large number of noisy samples in training. To address the above issue, we propose a concise SCOOD framework based on predictive sample assignment (PSA). PSA includes a dual-threshold ternary sample assignment strategy based on the predictive energy score that can significantly improve the purity of the selected ID and OOD sample sets by assigning unconfident unlabeled data to an additional discard sample set, and a concept contrastive representation learning loss to further expand the distance between ID and OOD samples in the representation space to assist ID/OOD discrimination. In addition, we also introduce a retraining strategy to help the model fully fit the selected auxiliary ID/OOD samples. Experiments on two standard SCOOD benchmarks demonstrate that our approach outperforms the state-of-the-art methods by a significant margin. The code is available at: <uri>https://github.com/ZhimaoPeng/PSA</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4686-4697"},"PeriodicalIF":8.3,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yafeng Li;Yuehan Chen;Jiqing Zhang;Yudong Li;Xianping Fu
{"title":"An Underwater Image Restoration Method With Polarization Imaging Optimization Model for Poor Visible Conditions","authors":"Yafeng Li;Yuehan Chen;Jiqing Zhang;Yudong Li;Xianping Fu","doi":"10.1109/TCSVT.2024.3512600","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3512600","url":null,"abstract":"Polarization imaging is extensively employed in underwater image restoration due to its effectiveness in removing backscattered light. However, existing polarization imaging methods generally assume the degree of polarization (DoP) of the backscattering is spatially constant and estimate it from the background region, limiting their practical applications. To address these challenges, we propose an underwater image restoration method based on a polarization imaging optimization model (PIOM). First, we develop a novel polarization image formation model by fusing the DoP and angle of polarization (AoP) of backscattered light. Second, we introduce an adaptive particle swarm local optimization (APSLO) method based on the PIOM. This method decomposes the image into small blocks and employs an objective optimization function to estimate the local optimal fusion parameters. Additionally, we propose a robust polynomial spatial fitting method to reduce block artifacts and noise disturbances, achieving globally optimal fusion parameters. Finally, we fully consider the advantages of gamma correction, and propose an adaptive contrast enhancement method to balance brightness and contrast. Experimental results show that our PIOM effectively removes backscattering while preserving finer details, colors, and contours. The code and datasets will be available at <uri>https://github.com/liyafengLYF/UIRPIOM</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3924-3939"},"PeriodicalIF":8.3,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Equivariance-Based Markov Decision Process for Unsupervised Point Cloud Registration","authors":"Yue Wu;Jiayi Lei;Yongzhe Yuan;Xiaolong Fan;Maoguo Gong;Wenping Ma;Qiguang Miao;Mingyang Zhang","doi":"10.1109/TCSVT.2024.3512858","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3512858","url":null,"abstract":"Unsupervised point cloud registration is crucial in 3D computer vision. However, most unsupervised methods struggle to construct effective optimization objectives and reliable unsupervised signals to enhance the performance of the model. To address these issues, with the observation of the significant alignment between the registration process and the Markov Decision Process (MDP), we model point cloud registration as MDP, which can provide more reliable unsupervised signals through the reward. We propose a colored noise based cross-entropy method, which introduces colored noise into sampling process, regulating the power spectral density of the action sequence and expanding the search space, improving the registration effect. Particularly, to strengthen constraints on MDP and training in the transformation space, we utilize equivariance theory to construct transformation equivariant constraint as a new optimization objective and derive equivariant constraint solutions for optimization, providing more reliable unsupervised signals. Extensive experiments demonstrate the superior performance of our method on benchmark datasets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4648-4660"},"PeriodicalIF":8.3,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung
{"title":"Improving Video Moment Retrieval by Auxiliary Moment-Query Pairs With Hyper-Interaction","authors":"Runhao Zeng;Yishen Zhuo;Jialiang Li;Yunjin Yang;Huisi Wu;Qi Chen;Xiping Hu;Victor C. M. Leung","doi":"10.1109/TCSVT.2024.3513633","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3513633","url":null,"abstract":"Most existing video moment retrieval (VMR) benchmark datasets face a common issue of sparse annotations-only a few moments being annotated. We argue that videos contain a broader range of meaningful moments that, if leveraged, could significantly enhance performance. Existing methods typically follow a generate-then-select paradigm, focusing primarily on generating moment-query pairs while neglecting the crucial aspect of selection. In this paper, we propose a new method, HyperAux, to yield auxiliary moment-query pairs by modeling the multi-modal hyper-interaction between video and language. Specifically, given a set of candidate moment-query pairs from a video, we construct a hypergraph with multiple hyperedges, each corresponding to a moment-query pair. Unlike traditional graphs where each edge connects only two nodes (frames or queries), each hyperedge connects multiple nodes, including all frames within a moment, semantically related frames outside the moment, and an input query. This design allows us to consider the frames within a moment as a whole, rather than modeling individual frame-query relationships separately. More importantly, constructing the relationships among all moment-query pairs within a video into a large hypergraph facilitates selecting higher-quality data from such pairs. On this hypergraph, we employ a hypergraph neural network to aggregate node information, update the hyperedge, and propagate video-language hyper-interactions to each connected node, resulting in context-aware node representations. This enables us to use node relevance to select high-quality moment-query pairs and refine the moments’ boundaries. We also exploit the discrepancy in semantic matching within and outside moments to construct a loss function for training the HGNN without human annotations. Our auxiliary data enhances the performance of twelve VMR models under fully-supervised, weakly-supervised, and zero-shot settings across three widely used VMR datasets: ActivityNet Captions, Charades-STA, and QVHighlights. We will release the source code and models publicly.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3940-3954"},"PeriodicalIF":8.3,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective Global Context Integration for Lightweight 3D Medical Image Segmentation","authors":"Qiang Qiao;Meixia Qu;Wenyu Wang;Bin Jiang;Qiang Guo","doi":"10.1109/TCSVT.2024.3511926","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3511926","url":null,"abstract":"Accurate and fast segmentation of 3D medical images is crucial in clinical analysis. CNNs struggle to capture long-range dependencies because of their inductive biases, whereas the Transformer can capture global features but faces a considerable computational burden. Thus, efficiently integrating global and detailed insights is key for precise segmentation. In this paper, we propose an effective and lightweight architecture named GCI-Net to address this issue. The key characteristic of GCI-Net is the global-guided feature enhancement strategy (GFES), which integrates the global context and facilitates the learning of local information; 3D convolutional attention, which captures long-range dependencies; and a progressive downsampling module, which perceives detailed information better. The GFES can capture the local range of information through global-guided feature fusion and global-local contrastive loss. All these designs collectively contribute to lower computational complexity and reliable performance improvements. The proposed model is trained and tested on four public datasets, namely MSD Brain Tumor, ACDC, BraTS2021, and MSD Lung. The experimental results show that, compared with several recent SOTA methods, our GCI-Net achieves superior computational efficiency with comparable or even better segmentation performance. The code is available at <uri>https://github.com/qintianjian-lab/GCI-Net</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4661-4674"},"PeriodicalIF":8.3,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guang-Yong Chen;Wei Dong;Guodong Fan;Jian-Nan Su;Min Gan;C. L. Philip Chen
{"title":"LPFSformer: Location Prior Guided Frequency and Spatial Interactive Learning for Nighttime Flare Removal","authors":"Guang-Yong Chen;Wei Dong;Guodong Fan;Jian-Nan Su;Min Gan;C. L. Philip Chen","doi":"10.1109/TCSVT.2024.3510925","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3510925","url":null,"abstract":"When capturing images under strong light sources at night, intense lens flare artifacts often appear, significantly degrading visual quality and impacting downstream computer vision tasks. Although transformer-based methods have achieved remarkable results in nighttime flare removal, they fail to adequately distinguish between flare and non-flare regions. This unified processing overlooks the unique characteristics of these regions, leading to suboptimal performance and unsatisfactory results in real-world scenarios. To address this critical issue, we propose a novel approach incorporating Location Prior Guidance (LPG) and a specialized flare removal model, LPFSformer. LPG is designed to accurately learn the location of flares within an image and effectively capture the associated glow effects. By employing Location Prior Injection (LPI), our method directs the model’s focus towards flare regions through the interaction of frequency and spatial domains. Additionally, to enhance the recovery of high-frequency textures and capture finer local details, we designed a Global Hybrid Feature Compensator (GHFC). GHFC aggregates different expert structures, leveraging the diverse receptive fields and CNN operations of each expert to effectively utilize a broader range of features during the flare removal process. Extensive experiments demonstrate that our LPFSformer achieves state-of-the-art flare removal performance compared to existing methods. Our code and a pre-trained LPFSformer have been uploaded to GitHub for validation.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 4","pages":"3706-3718"},"PeriodicalIF":8.3,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143783317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen
{"title":"Query as Supervision: Toward Low-Cost and Robust Video Moment and Highlight Retrieval","authors":"Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen","doi":"10.1109/TCSVT.2024.3510950","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3510950","url":null,"abstract":"Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at <uri>https://github.com/CFM-MSG/Code_QaS</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3955-3968"},"PeriodicalIF":8.3,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}