Image and Vision Computing最新文献

筛选
英文 中文
Spatial cascaded clustering and weighted memory for unsupervised person re-identification
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-03 DOI: 10.1016/j.imavis.2025.105478
Jiahao Hong, Jialong Zuo, Chuchu Han, Ruochen Zheng, Ming Tian, Changxin Gao, Nong Sang
{"title":"Spatial cascaded clustering and weighted memory for unsupervised person re-identification","authors":"Jiahao Hong,&nbsp;Jialong Zuo,&nbsp;Chuchu Han,&nbsp;Ruochen Zheng,&nbsp;Ming Tian,&nbsp;Changxin Gao,&nbsp;Nong Sang","doi":"10.1016/j.imavis.2025.105478","DOIUrl":"10.1016/j.imavis.2025.105478","url":null,"abstract":"<div><div>Recent advancements in unsupervised person re-identification (re-ID) methods have demonstrated high performance by leveraging fine-grained local context, often referred to as part-based methods. However, many existing part-based methods rely on horizontal division to obtain local contexts, leading to misalignment issues caused by various human poses. Moreover, misalignment of semantic information within part features hampers the effectiveness of metric learning, thereby limiting the potential of part-based methods. These challenges result in under-utilization of part features in existing approaches. To address these issues, we introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the issues of foreground omissions and spatial confusions in previous methods. We then propose foreground and space corrections to enhance the completeness and reasonableness of human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, enabling better utilization of both global and part features. Extensive experiments conducted on Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of the proposed method over numerous state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105478"},"PeriodicalIF":4.2,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143577301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Video Wire Inpainting via Hierarchical Feature Mixture
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-03 DOI: 10.1016/j.imavis.2025.105460
Zhong Ji, Yimu Su, Yan Zhang, Shuangming Yang, Yanwei Pang
{"title":"Video Wire Inpainting via Hierarchical Feature Mixture","authors":"Zhong Ji,&nbsp;Yimu Su,&nbsp;Yan Zhang,&nbsp;Shuangming Yang,&nbsp;Yanwei Pang","doi":"10.1016/j.imavis.2025.105460","DOIUrl":"10.1016/j.imavis.2025.105460","url":null,"abstract":"<div><div>Video wire inpainting aims at automatically eliminating visible wires from film footage, significantly streamlining post-production workflows. Previous models address redundancy in wire removal by eliminating redundant blocks to enhance focus on crucial wire details for more accurate reconstruction. However, once redundancy is removed, the disorganized non-redundant blocks disrupt temporal and spatial coherence, making seamless inpainting challenging. The absence of multi-scale feature fusion further limits the model’s ability to handle different wire scales and blend inpainted regions with complex backgrounds. To address these challenges, we propose a Hierarchical Feature Mixture Network (HFM-Net) that integrates two novel modules: a Hierarchical Transformer Module (HTM) and a Spatio-temporal Feature Mixture Module (SFM). Specifically, the HTM employs redundancy-aware attention modules and lightweight transformers to reorganize and fuse key high- and low-dimensional patches. The lightweight transformers are sufficient due to the reduced number of non-redundant blocks processing. By aggregating similar features, these transformers guide the alignment of non-redundant blocks and achieve effective spatio-temporal synchronization. Building on this, the SFM incorporates gated convolutions and GRU to enhance spatial and temporal integration further. Gated convolutions fuse low- and high-dimensional features, while the GRU captures temporal dependencies, enabling seamless inpainting of dynamic wire patterns. Additionally, we introduce a lightweight 3D separable convolution discriminator to improve video quality during the inpainting process while reducing computational costs. Experimental results demonstrate that HFM-Net achieves state-of-the-art performance on the video wire removal task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105460"},"PeriodicalIF":4.2,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time localization and navigation method for autonomous vehicles based on multi-modal data fusion by integrating memory transformer and DDQN 基于记忆变换器和 DDQN 的多模态数据融合的自动驾驶汽车实时定位和导航方法
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-02 DOI: 10.1016/j.imavis.2025.105484
Li Zha , Chen Gong , Kunfeng Lv
{"title":"Real-time localization and navigation method for autonomous vehicles based on multi-modal data fusion by integrating memory transformer and DDQN","authors":"Li Zha ,&nbsp;Chen Gong ,&nbsp;Kunfeng Lv","doi":"10.1016/j.imavis.2025.105484","DOIUrl":"10.1016/j.imavis.2025.105484","url":null,"abstract":"<div><div>In the field of autonomous driving, real-time localization and navigation are the core technologies that ensure vehicle safety and precise operation. With advancements in sensor technology and computing power, multi-modal data fusion has become a key method for enhancing the environmental perception capabilities of autonomous vehicles. This study aims to explore a novel visual-language navigation technology to achieve precise navigation of autonomous cars in complex environments. By integrating information from radar, sonar, 5G networks, Wi-Fi, Bluetooth, and a 360-degree visual information collection device mounted on the vehicle's roof, the model fully exploits rich multi-source data. The model uses the Memory Transformer for efficient data encoding and a data fusion strategy with a self-attention network, ensuring a balance between feature integrity and algorithm real-time performance. Furthermore, the encoded data is input into a DDQN vehicle navigation algorithm based on an automatically growing environmental target knowledge graph and large-scale scene maps, enabling continuous learning and optimization in real-world environments. Comparative experiments show that the proposed model outperforms existing SOTA models, particularly in terms of macro-spatial reference from large-scale scene maps, background knowledge support from the automatically growing knowledge graph, and the experience-optimized navigation strategies of the DDQN algorithm. In the comparative experiments with the SOTA models, the proposed model achieved scores of 3.99, 0.65, 0.67, 0.65, 0.63, and 0.63 on the six metrics NE, SR, OSR, SPL, CLS, and DTW, respectively. All of these results significantly enhance the intelligent positioning and navigation capabilities of autonomous driving vehicles.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105484"},"PeriodicalIF":4.2,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143577422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-information guided camouflaged object detection
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-03-01 DOI: 10.1016/j.imavis.2025.105470
Caijuan Shi , Lin Zhao , Rui Wang , Kun Zhang , Fanyue Kong , Changyu Duan
{"title":"Multi-information guided camouflaged object detection","authors":"Caijuan Shi ,&nbsp;Lin Zhao ,&nbsp;Rui Wang ,&nbsp;Kun Zhang ,&nbsp;Fanyue Kong ,&nbsp;Changyu Duan","doi":"10.1016/j.imavis.2025.105470","DOIUrl":"10.1016/j.imavis.2025.105470","url":null,"abstract":"<div><div>Camouflaged Object Detection (COD) aims to identify the objects hidden in the background environment. Though more and more COD methods have been proposed in recent years, existing methods still perform poorly for detecting small objects, obscured objects, boundary-rich objects, and multi-objects, mainly because they fail to effectively utilize context information, texture information, and boundary information simultaneously. Therefore, in this paper, we propose a Multi-information Guided Camouflaged Object Detection Network (MIGNet) to fully utilize multi-information containing context information, texture information, and boundary information to boost the performance of camouflaged object detection. Specifically, firstly, we design the texture and boundary label and the Texture and Boundary Enhanced Module (TBEM) to obtain differentiated texture information and boundary information. Next, the Neighbor Context Information Exploration Module (NCIEM) is designed to obtain rich multi-scale context information. Then, the Parallel Group Bootstrap Module (PGBM) is designed to maximize the effective aggregation of context information, texture information and boundary information. Finally, Information Enhanced Decoder (IED) is designed to effectively enhance the interaction of neighboring layer features and suppress the background noise for good detection results. Extensive quantitative and qualitative experiments are conducted on four widely used datasets. The experimental results indicate that our proposed MIGNet with good performance of camouflaged object detection outperforms the other 22 COD models.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105470"},"PeriodicalIF":4.2,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143551280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CMS-net: Edge-aware multimodal MRI feature fusion for brain tumor segmentation
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-28 DOI: 10.1016/j.imavis.2025.105481
Chunjie Lv , Biyuan Li , Xiuwei Wang , Pengfei Cai , Bo Yang , Xuefeng Jia , Jun Yan
{"title":"CMS-net: Edge-aware multimodal MRI feature fusion for brain tumor segmentation","authors":"Chunjie Lv ,&nbsp;Biyuan Li ,&nbsp;Xiuwei Wang ,&nbsp;Pengfei Cai ,&nbsp;Bo Yang ,&nbsp;Xuefeng Jia ,&nbsp;Jun Yan","doi":"10.1016/j.imavis.2025.105481","DOIUrl":"10.1016/j.imavis.2025.105481","url":null,"abstract":"<div><div>With the growing application of artificial intelligence in medical image processing, multimodal MRI brain tumor segmentation has become crucial for clinical diagnosis and treatment. Accurate segmentation relies heavily on the effective utilization of multimodal information. However, most existing methods primarily focus on global and local deep semantic features, often overlooking critical aspects such as edge information and cross-channel correlations. To address these limitations while retaining the strengths of existing methods, we propose a novel brain tumor segmentation approach: an edge-aware feature fusion model based on a dual-encoder architecture. CMS-Net is a novel brain tumor segmentation model that integrates edge-aware fusion, cross-channel interaction, and spatial state feature extraction to fully leverage multimodal information for improved segmentation accuracy. The architecture comprises two main components: an encoder and a decoder. The encoder utilizes both convolutional downsampling and Smart Swin Transformer downsampling, with the latter employing Shifted Spatial Multi-Head Self-Attention (SSW-MSA) to capture global features and enhance long-range dependencies. The decoder reconstructs the image via the CMS-Block, which consists of three key modules: the Multi-Scale Deep Convolutional Cross-Channel Attention module (MDTA), the Spatial State Module (SSM), and the Boundary-Aware Feature Fusion module (SWA). CMS-Net's dual-encoder architecture allows for deep extraction of both local and global features, enhancing segmentation performance. MDTA generates attention maps through cross-channel covariance, while SSM models spatial context to improve the understanding of complex structures. The SWA module, combining SSW-MSA with pooling, subtraction, and convolution, facilitates feature fusion and edge extraction. Dice and Focal loss functions were introduced to optimize cross-channel and spatial feature extraction. Experimental results on the BraTS2018, BraTS2019, and BraTS2020 datasets demonstrate that CMS-Net effectively integrates spatial state, cross-channel, and boundary information, significantly improving multimodal brain tumor segmentation accuracy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105481"},"PeriodicalIF":4.2,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Transformer and Mamba fusion for multispectral object detection
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-27 DOI: 10.1016/j.imavis.2025.105468
Chao Li, Xiaoming Peng
{"title":"Joint Transformer and Mamba fusion for multispectral object detection","authors":"Chao Li,&nbsp;Xiaoming Peng","doi":"10.1016/j.imavis.2025.105468","DOIUrl":"10.1016/j.imavis.2025.105468","url":null,"abstract":"<div><div>Multispectral object detection is generally considered better than single-modality-based object detection, due to the complementary properties of multispectral image pairs. However, how to integrate features from images of different modalities for object detection is still an open problem. In this paper, we propose a new multispectral object detection framework based on the Transformer and Mamba architectures, called the joint Transformer and Mamba detection (JTMDet). Specifically, we divide the feature fusion process into two stages, the intra-scale fusion stage and the inter-scale fusion stage, to comprehensively utilize the multi-modal features at different scales. To this end, we designed the so-called cross-modal fusion (CMF) and cross-level fusion (CLF) modules, both of which contain JTMBlock modules. A JTMBlock module interweaves the Transformer and Mamba layers to robustly capture the useful information in multispectral image pairs while maintaining high inference speed. Extensive experiments on three publicly available datasets conclusively show that the proposed JTMDet framework achieves state-of-the-art multispectral object detection performance, and is competitive with current leading methods. Code and pre-trained models are publicly available at <span><span>https://github.com/LiC2023/JTMDet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105468"},"PeriodicalIF":4.2,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143563581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A spatial-frequency domain multi-branch decoder method for real-time semantic segmentation
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-26 DOI: 10.1016/j.imavis.2025.105483
Liwei Deng , Boda Wu , Songyu Chen , Dongxue Li , Yanze Fang
{"title":"A spatial-frequency domain multi-branch decoder method for real-time semantic segmentation","authors":"Liwei Deng ,&nbsp;Boda Wu ,&nbsp;Songyu Chen ,&nbsp;Dongxue Li ,&nbsp;Yanze Fang","doi":"10.1016/j.imavis.2025.105483","DOIUrl":"10.1016/j.imavis.2025.105483","url":null,"abstract":"<div><div>Semantic segmentation is crucial for the functionality of autonomous driving systems. However, most of the existing real-time semantic segmentation models focus on encoder design and underutilize spatial and frequency domain information in the decoder, limiting the segmentation accuracy of the model. To solve this problem, this paper proposes a multi-branch decoder network combining spatial domain and frequency domain to meet the real-time and accuracy requirements of the semantic segmentation task of road scenes for autonomous driving systems. Firstly, the network introduces a novel multi-scale dilated fusion block that gradually enlarges the receptive field through three consecutive dilated convolutions, and integrates features from different levels using skip connections. At the same time, a strategy of gradually reducing the number of channels is adopted to effectively remove redundant features. Secondly, we design three branches for the decoder. The global branch utilizes a lightweight Transformer architecture to extract global features and employs horizontal and vertical convolutions to achieve interaction among global features. The multi-scale branch combines dilated convolution and adaptive pooling to perform multi-scale feature extraction through fusion and post-processing. The wavelet transform feature converter maps spatial domain features into low-frequency and high-frequency components, which are then fused with global and multi-scale features to enhance the model representation. Finally, we conduct experiments on multiple datasets. The experimental results show that the proposed method best balances segmentation accuracy and inference speed.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105483"},"PeriodicalIF":4.2,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143527561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SFFEF-YOLO: Small object detection network based on fine-grained feature extraction and fusion for unmanned aerial images
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-26 DOI: 10.1016/j.imavis.2025.105469
Chenxi Bai , Kexin Zhang , Haozhe Jin , Peng Qian , Rui Zhai , Ke Lu
{"title":"SFFEF-YOLO: Small object detection network based on fine-grained feature extraction and fusion for unmanned aerial images","authors":"Chenxi Bai ,&nbsp;Kexin Zhang ,&nbsp;Haozhe Jin ,&nbsp;Peng Qian ,&nbsp;Rui Zhai ,&nbsp;Ke Lu","doi":"10.1016/j.imavis.2025.105469","DOIUrl":"10.1016/j.imavis.2025.105469","url":null,"abstract":"<div><div>Unmanned aerial vehicles (UAVs) images object detection has emerged as a research hotspot, yet remains a significant challenge due to variable target scales and the high proportion of small objects caused by UAVs’ diverse altitudes and angles. To address these issues, we propose a novel Small Object Detection Network Based on Fine-Grained Feature Extraction and Fusion(SFFEF-YOLO). First, we introduce a tiny prediction head to replace the large prediction head, enhancing the detection accuracy for tiny objects while reducing model complexity. Second, we design a Fine-Grained Information Extraction Module (FIEM) to replace standard convolutions. This module improves feature extraction and reduces information loss during downsampling by utilizing multi-branch operations and SPD-Conv. Third, we develop a Multi-Scale Feature Fusion Module (MFFM), which adds an additional skip connection branch based on the bidirectional feature pyramid network (BiFPN) to preserve fine-grained information and improve multi-scale feature fusion. We evaluated SFFEF-YOLO on the VisDrone2019-DET and UAVDT datasets. Compared to YOLOv8, experimental results demonstrate that SFFEF-YOLO achieves a 9.9% mAP0.5 improvement on the VisDrone2019-DET dataset and a 3.6% mAP0.5 improvement on the UAVDT dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105469"},"PeriodicalIF":4.2,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143519440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AF2CN: Towards effective demoiréing from multi-resolution images
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-24 DOI: 10.1016/j.imavis.2025.105467
Shitan Asu, Yujin Dai, Shijie Li, Zheng Li
{"title":"AF2CN: Towards effective demoiréing from multi-resolution images","authors":"Shitan Asu,&nbsp;Yujin Dai,&nbsp;Shijie Li,&nbsp;Zheng Li","doi":"10.1016/j.imavis.2025.105467","DOIUrl":"10.1016/j.imavis.2025.105467","url":null,"abstract":"<div><div>Recently, CNN-based methods have gained significant attention for addressing the demoiré task due to their powerful feature extraction capabilities. However, these methods are generally trained on datasets with fixed resolutions, limiting their applicability to diverse real-world scenarios. To address this limitation, we introduce a more generalized task: effective demoiréing across multiple resolutions. To facilitate this task, we constructed MTADM, the first multi-resolution moiré dataset, designed to capture diverse real-world scenarios. Leveraging this dataset, we conducted extensive studies and introduced the Adaptive Fractional Calculus and Adjacency Fusion Convolution Network (AF2CN). Specifically, we employ fractional derivatives to develop an adaptive frequency enhancement module, which refines spatial distribution and texture details in moiré patterns. Additionally, we design a spatial attention gate to enhance deep feature interaction. Extensive experiments demonstrate that AF2CN effectively handles multi-resolution moiré patterns. It significantly outperforms previous state-of-the-art methods on fixed-resolution benchmarks while requiring fewer parameters and achieving lower computational costs.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105467"},"PeriodicalIF":4.2,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143487618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NPVForensics: Learning VA correlations in non-critical phoneme–viseme regions for deepfake detection
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-23 DOI: 10.1016/j.imavis.2025.105461
Yu Chen , Yang Yu , Rongrong Ni , Haoliang Li , Wei Wang , Yao Zhao
{"title":"NPVForensics: Learning VA correlations in non-critical phoneme–viseme regions for deepfake detection","authors":"Yu Chen ,&nbsp;Yang Yu ,&nbsp;Rongrong Ni ,&nbsp;Haoliang Li ,&nbsp;Wei Wang ,&nbsp;Yao Zhao","doi":"10.1016/j.imavis.2025.105461","DOIUrl":"10.1016/j.imavis.2025.105461","url":null,"abstract":"<div><div>Advanced deepfake technology enables the manipulation of visual and audio signals within videos, leading to visual–audio (VA) inconsistencies. Current multimodal detectors primarily rely on VA contrastive learning to identify such inconsistencies, particularly in critical phoneme–viseme (PV) regions. However, state-of-the-art deepfake techniques have aligned critical PV pairs, thereby reducing the inconsistency traces on which existing methods rely. Due to technical constraints, forgers cannot fully synchronize VA in non-critical phoneme–viseme (NPV) regions. Consequently, we exploit inconsistencies in NPV regions as a general cue for deepfake detection. We propose NPVForensics, a two-stage VA correlation learning framework specifically designed to detect VA inconsistencies in NPV regions of deepfake videos. Firstly, to better extract VA unimodal features, we utilize the Swin Transformer to capture long-term global dependencies. Additionally, the Local Feature Aggregation (LFA) module aggregates local features from spatial and channel dimensions, thus preserving more comprehensive and subtle information. Secondly, the VA Correlation Learning (VACL) module enhances intra-modal augmentation and inter-modal information interaction, exploring intrinsic correlations between the two modalities. Moreover, Representation Alignment is introduced for real videos to narrow the modal gap and effectively extract VA correlations. Finally, our model is pre-trained on real videos using a self-supervised strategy and fine-tuned for the deepfake detection task. We conducted extensive experiments on six widely used deepfake datasets: FaceForensics++, FakeAVCeleb, Celeb-DF-v2, DFDC, FaceShifter, and DeeperForensics-1.0. Our method achieves state-of-the-art performance in cross-manipulation generalization and robustness. Notably, our approach demonstrates superior performance on VA-coordinated datasets such as A2V, T2V-L, and T2V-S. It indicates that VA inconsistencies in NPV regions serve as a general cue for deepfake detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105461"},"PeriodicalIF":4.2,"publicationDate":"2025-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143519335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信