Image and Vision Computing最新文献

筛选
英文 中文
Phase shift guided dynamic view synthesis from monocular video 基于单目视频的相移制导动态视图合成
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-18 DOI: 10.1016/j.imavis.2025.105702
Chuyue Zhao, Xin Huang, Xue Wang, Guoqing Zhou, Qing Wang
{"title":"Phase shift guided dynamic view synthesis from monocular video","authors":"Chuyue Zhao,&nbsp;Xin Huang,&nbsp;Xue Wang,&nbsp;Guoqing Zhou,&nbsp;Qing Wang","doi":"10.1016/j.imavis.2025.105702","DOIUrl":"10.1016/j.imavis.2025.105702","url":null,"abstract":"<div><div>This paper endeavors to address the challenge of synthesizing novel views from monocular videos featuring moving objects, particularly in complex scenes with non-rigid deformations. Existing implicit representations rely on motion estimation in the spatial domain, which often struggle to capture correct temporal dynamics under such conditions. To mitigate the drawback, we propose dynamic positional encoding to represent temporal dynamics as learnable phase shifts and leverage the implicit neural representation (INR) network for iterative optimization. Utilizing optimized phase shifts as guidance enhances the representational capability of the dynamic radiance field, thereby alleviating motion ambiguity and reducing artifacts around moving objects in novel views. This paper also introduces a rational evaluation metric, referred to as “dynamic only+”, for the quantitative assessment of the rendering quality in novel views, focusing on dynamic objects and surrounding regions impacted by motion. Experimental results on multiple challenging datasets demonstrate the favorable performance of the proposed approach over state-of-the-art dynamic view synthesis methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105702"},"PeriodicalIF":4.2,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144865776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Codebook prior-guided hybrid attention dehazing network 码本先验引导混合注意力除雾网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-16 DOI: 10.1016/j.imavis.2025.105700
Liqin Huang , Hanyu Zheng , Lin Pan , Zhipeng Su , Qiang Wu
{"title":"Codebook prior-guided hybrid attention dehazing network","authors":"Liqin Huang ,&nbsp;Hanyu Zheng ,&nbsp;Lin Pan ,&nbsp;Zhipeng Su ,&nbsp;Qiang Wu","doi":"10.1016/j.imavis.2025.105700","DOIUrl":"10.1016/j.imavis.2025.105700","url":null,"abstract":"<div><div>Transformers have been widely used in image dehazing tasks due to their powerful self-attention mechanism for capturing long-range dependencies. However, directly applying Transformers often leads to coarse details during image reconstruction, especially in complex real-world hazy scenarios. To address this problem, we propose a novel Hybrid Attention Encoder (HAE). Specifically, a channel-attention-based convolution block is integrated into the Swin-Transformer architecture. This design enhances the local features at each position through an overlapping block-wise spatial attention mechanism while leveraging the advantages of channel attention in global information processing to strengthen the network’s representation capability. Moreover, to adapt to various complex hazy environments, a high-quality codebook prior encapsulating the color and texture knowledge of high-resolution clear scenes is introduced. We also propose a more flexible Binary Matching Mechanism (BMM) to better align the codebook prior with the network, further unlocking the potential of the model. Extensive experiments demonstrate that our method consistently outperforms the second-best methods by a margin of 8% to 19% across multiple metrics on the RTTS and URHI datasets. The source code has been released at <span><span>https://github.com/HanyuZheng25/HADehzeNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105700"},"PeriodicalIF":4.2,"publicationDate":"2025-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144865872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TSGaussian: Semantic and depth-guided Target-Specific Gaussian Splatting from sparse views TSGaussian:基于稀疏视图的语义和深度引导的目标特定高斯溅射
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-16 DOI: 10.1016/j.imavis.2025.105706
Liang Zhao , Zehan Bao , Yi Xie , Hong Chen , Yaohui Chen , Weifu Li
{"title":"TSGaussian: Semantic and depth-guided Target-Specific Gaussian Splatting from sparse views","authors":"Liang Zhao ,&nbsp;Zehan Bao ,&nbsp;Yi Xie ,&nbsp;Hong Chen ,&nbsp;Yaohui Chen ,&nbsp;Weifu Li","doi":"10.1016/j.imavis.2025.105706","DOIUrl":"10.1016/j.imavis.2025.105706","url":null,"abstract":"<div><div>Recent advances in Gaussian Splatting have significantly advanced the field, achieving both panoptic and interactive segmentation of 3D scenes. However, existing methodologies often overlook the critical need for reconstructing specified targets with complex structures from sparse views. To address this issue, we introduce TSGaussian, a framework that combines semantic constraints with depth priors to avoid geometry degradation in challenging novel view synthesis tasks. Our approach prioritizes computational resources on designated targets while minimizing background allocation. Bounding boxes from YOLOv9 serve as prompts for Segment Anything Model to generate 2D mask predictions, ensuring semantic accuracy and cost efficiency. TSGaussian effectively clusters 3D gaussians by introducing a compact identity encoding for each Gaussian ellipsoid and incorporating 3D spatial consistency regularization. Leveraging these modules, we propose a pruning strategy to effectively reduce redundancy in 3D gaussians. Extensive experiments demonstrate that TSGaussian outperforms state-of-the-art methods on three standard datasets and a self-built dataset, achieving superior results in novel view synthesis of specific objects. Code is available at: <span><span>https://github.com/leon2000-ai/TSGaussian</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105706"},"PeriodicalIF":4.2,"publicationDate":"2025-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144892838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-aware contrastive learning for glomerulus segmentation in renal pathology 肾脏病理中肾小球分割的结构感知对比学习
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-16 DOI: 10.1016/j.imavis.2025.105698
Yuanqing Wang , Tao Wang , Xiangbo Shu , Yuhui Zheng , Jin Ding , Xianghui Fu , Zhaohui Zheng
{"title":"Structure-aware contrastive learning for glomerulus segmentation in renal pathology","authors":"Yuanqing Wang ,&nbsp;Tao Wang ,&nbsp;Xiangbo Shu ,&nbsp;Yuhui Zheng ,&nbsp;Jin Ding ,&nbsp;Xianghui Fu ,&nbsp;Zhaohui Zheng","doi":"10.1016/j.imavis.2025.105698","DOIUrl":"10.1016/j.imavis.2025.105698","url":null,"abstract":"<div><div>Accurate segmentation of glomeruli in renal pathology is challenging due to the difficulty in distinguishing glomeruli from surrounding tissues and their indistinct boundaries. Traditional methods often struggle with local receptive fields, primarily capturing texture rather than the overall shape of these structures. To address this issue, this paper presents a structure-aware contrastive learning strategy for precise glomerular segmentation. We implement a superpixel consistency constraint, dividing pathological images into regions of local consistency to ensure that pixels within the same area maintain feature similarity, thereby capturing structural cues of various renal tissues. The introduced loss function applies shape constraints, enabling the model to better represent the complex morphology of glomeruli against challenging backgrounds. To enhance shape consistency within glomeruli while ensuring discriminability from external tissues, we develop a contrastive learning approach that utilizes extracted structural cues. This encourages the network to effectively learn internal shape constraints and differentiate between distinct regions in feature space. Finally, we implement a multi-scale convolutional attention mechanism that integrates spatial and channel attention, improving the capture of structural features across scales. Experimental results demonstrate that our method significantly enhances segmentation accuracy across multiple public datasets, showcasing the potential of contrastive learning in renal pathology.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105698"},"PeriodicalIF":4.2,"publicationDate":"2025-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144906953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ECNet: An edge-guided and cross-image perception network for collaborative camouflaged object detection ECNet:用于协同伪装目标检测的边缘引导和交叉图像感知网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-14 DOI: 10.1016/j.imavis.2025.105697
Shiyuan Li , Hongbo Bi , Disen Mo , Cong Zhang , Yue Li
{"title":"ECNet: An edge-guided and cross-image perception network for collaborative camouflaged object detection","authors":"Shiyuan Li ,&nbsp;Hongbo Bi ,&nbsp;Disen Mo ,&nbsp;Cong Zhang ,&nbsp;Yue Li","doi":"10.1016/j.imavis.2025.105697","DOIUrl":"10.1016/j.imavis.2025.105697","url":null,"abstract":"<div><div>Traditional camouflaged object detection (COD) methods typically focus on individual images, ignoring the contextual information from multiple related images. However, objects are often captured in multiple images or from different viewpoints in real scenarios. Leveraging collaborative information from multiple images can achieve more robust and accurate detection. This collaborative approach, known as “Collaborative Camouflaged Object Detection (CoCOD)”, addresses the limitations of single-image methods by exploiting complementary information from multiple images, enhancing detection performance. Recent advancements in CoCOD have shown notable progress. However, challenges remain in effectively extracting multi-scale features and facilitating cross-attention feature interactions. To address these limitations, we propose a novel framework, named the Edge-Guided and Cross-Image Perception Network (ECNet). The ECNet consists of two core components: the edge-guided scale module (EGSM) and the cross-image perception enhancement module (CPEM). Specifically, EGSM enhances feature extraction by integrating edge-aware guidance with multi-scale asymmetric convolutions. Meanwhile, CPEM strengthens cross-image feature interaction by introducing collaborative attention, which reinforces semantic consistency among correlated targets and suppresses distracting background information. By integrating edge-aware features across multiple spatial scales and cross-image semantic consistency, ECNet effectively addresses the challenges of camouflage detection in visually complex scenarios. Extensive experiments on the CoCOD8K dataset demonstrate that our proposed ECNet outperforms 18 state-of-the-art COD methods, 11 co-salient object detection (CoSOD) models, and 4 CoCOD approaches, as evaluated by six widely used metrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105697"},"PeriodicalIF":4.2,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144865873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A non-local adaptive hypothesis propagation for multi-view stereo 多视点立体的非局部自适应假设传播
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-11 DOI: 10.1016/j.imavis.2025.105704
Yufeng Yin , Xiaoyan Liu , Qing Fan , Zichao Zhang
{"title":"A non-local adaptive hypothesis propagation for multi-view stereo","authors":"Yufeng Yin ,&nbsp;Xiaoyan Liu ,&nbsp;Qing Fan ,&nbsp;Zichao Zhang","doi":"10.1016/j.imavis.2025.105704","DOIUrl":"10.1016/j.imavis.2025.105704","url":null,"abstract":"<div><div>Hypothesis propagation is a central component of PatchMatch-based multi-view stereo and significantly impacts the reconstruction performance. However, current propagation methods rely on photometric consistency to guide hypothesis propagation within a local area. When the centroid is located in a low-textured area with reflective or refractive properties, high chromatic aberration may cause the multi-view matching to fall into a local optimum that fails to provide reliable hypotheses, leading to reconstruction errors. To address this problem, we propose a non-local adaptive hypothesis propagation scheme. First, we evenly distribute sampling points in eight directions on the checkerboard to quickly determine reliable initial hypotheses. Then, starting from the initial hypotheses generated in the eight directions of the checkerboard, the hypotheses are adaptively propagated to non-checkerboard areas based on matching cost, reducing interference from unreliable photometric consistency and improving reconstruction performance in challenging areas. The test results on large-scale benchmarks show that the proposed scheme has significant advantages in reconstructing challenging areas. It can significantly improve the completeness of point clouds from current state-of-the-art methods and outperform existing propagation schemes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105704"},"PeriodicalIF":4.2,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed collaborative machine learning in real-world application scenario: A white blood cell subtypes classification case study 真实世界应用场景中的分布式协作机器学习:一个白细胞亚型分类案例研究
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-11 DOI: 10.1016/j.imavis.2025.105673
Lorenzo Putzu , Simone Porcu , Andrea Loddo
{"title":"Distributed collaborative machine learning in real-world application scenario: A white blood cell subtypes classification case study","authors":"Lorenzo Putzu ,&nbsp;Simone Porcu ,&nbsp;Andrea Loddo","doi":"10.1016/j.imavis.2025.105673","DOIUrl":"10.1016/j.imavis.2025.105673","url":null,"abstract":"<div><div>White blood cell (WBC) subtype classification is a critical step in monitoring an individual’s health. However, it remains a challenging task due to the significant morphological variability of WBCs and the domain shift introduced by differing acquisition protocols across hospitals. Numerous approaches have been proposed to mitigate domain shift, including supervised and unsupervised domain adaptation, as well as domain generalisation. These methods, however, require a suitable amount of representative target images, even if unlabelled, or a suitable amount of images from multiple sources, which may not be feasible due to privacy regulations. In this study, we explore an alternative paradigm, known as <em>Distributed Collaborative Machine Learning</em> (DCML), which consists of exploiting images from different sources in a privacy-preserving setup. Although DCML methods seem well suited to this application, to the best of our knowledge, they have not been used for this task or to address the above-mentioned issues. However, we argue that DCML deserves further consideration in medical images as a potential alternative solution against domain shift in a privacy-preserving setup. To substantiate our view, we consider three DCML methods: early and late fusion and federated learning approaches, each offering distinct trade-offs in terms of training constraints, computational overhead and communications costs. We then conduct an extensive, cross-dataset experimental evaluation on four benchmark datasets and provide evidence that even <em>simple</em> implementations of DCML methods can effectively mitigate domain shift in WBC classification tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105673"},"PeriodicalIF":4.2,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatiotemporal XAI: Explaining video regression models in echocardiography videos for ejection fraction prediction 时空XAI:解释超声心动图视频中用于射血分数预测的视频回归模型
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-08 DOI: 10.1016/j.imavis.2025.105691
Yakup Abrek Er , Arda Guler , Mehmet Cagri Demir , Hande Uysal , Gamze Babur Guler , Ilkay Oksuz
{"title":"Spatiotemporal XAI: Explaining video regression models in echocardiography videos for ejection fraction prediction","authors":"Yakup Abrek Er ,&nbsp;Arda Guler ,&nbsp;Mehmet Cagri Demir ,&nbsp;Hande Uysal ,&nbsp;Gamze Babur Guler ,&nbsp;Ilkay Oksuz","doi":"10.1016/j.imavis.2025.105691","DOIUrl":"10.1016/j.imavis.2025.105691","url":null,"abstract":"<div><div>Deep learning has showcased unprecedented success in automating echocardiography analysis. However, most of the deep learning algorithms are hindered at clinical translation due to their black-box nature. This paper aims to tackle this issue by quantitatively evaluating video regression models’ focus on the left ventricle (LV) for ejection fraction (EF) prediction task spatiotemporally in apical 4 chamber (A4C) echocardiograms using a gradient-based saliency method. We performed a quantitative evaluation to assess the ratio of how many of the maximum absolute gradient values of the deep learning models fall on the left ventricle for the video regression task of ejection fraction prediction. Then, we extend the experiment and pick the most important gradients as the segmentation size and check the ratio of intersection. Finally, we picked temporally aligned sub-clips from end diastole to end systole and calculated the expected accuracies of the mentioned metrics in time. All tests are performed in 3 different models with different architectures and results are examined quantitatively. The filtered test set includes 1209 A4C echo videos of with mean EF of 55.5%. Trained models showed 0.73 to 0.83 Pointing Game scores, where it was 0.09 for the baseline random model. <span><math><msub><mrow><mi>m</mi></mrow><mrow><mi>G</mi><mi>T</mi></mrow></msub></math></span> intersection score was 0.46 to 0.50 for the trained models, whereas the random model’s score was 0.18. Models have higher pointing game scores on the end diastole and end systole compared to intermediate frames. Transformer based models’ <span><math><msub><mrow><mi>m</mi></mrow><mrow><mi>G</mi><mi>T</mi></mrow></msub></math></span> intersection scores were negatively correlated with their error rate. All models located the left ventricle successfully and their localization performance was generally better in semantically important frames rather than the larger target area. This observation from the spatiotemporal analysis suggests possible clinical relevance to model reasoning.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105691"},"PeriodicalIF":4.2,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144840768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InpaintingPose: Enhancing human pose transfer by image inpainting InpaintingPose:通过图像绘制增强人体姿势转移
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-08 DOI: 10.1016/j.imavis.2025.105690
Wei Zhang, Chenglin Zhou, Xuekang Peng, Zhichao Lian
{"title":"InpaintingPose: Enhancing human pose transfer by image inpainting","authors":"Wei Zhang,&nbsp;Chenglin Zhou,&nbsp;Xuekang Peng,&nbsp;Zhichao Lian","doi":"10.1016/j.imavis.2025.105690","DOIUrl":"10.1016/j.imavis.2025.105690","url":null,"abstract":"<div><div>Human pose transfer involves transforming a human subject in a reference image from a source pose to a target pose while maintaining consistency in both appearance and background. Most existing methods treat the appearance and background in the reference image as a unified entity, which causes the background to be disrupted by pose transformations and prevents the model from focusing on the complex relationship between appearance and pose. In this paper, we propose InpaintingPose, a novel human pose transfer framework based on image inpainting, which enables precise pose control without affecting the background. InpaintingPose separates the background from the appearance, applying transformations only where necessary. This strategy prevents the background from being affected by pose transformations and allows the model to focus on the coupling between appearance and pose. Additionally, we introduce an appearance control mechanism to ensure appearance consistency between the generated images and the reference images. Finally, we propose an initial noise optimization strategy to address the instability in generating human images with extremely bright backgrounds. By decoupling appearance and background, InpaintingPose can also recombine the appearance and background from different reference images to produce realistic human images. Extensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art FID scores of 4.74 and 26.74 on DeepFashionv2 and TikTok datasets, respectively, significantly outperforming existing approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105690"},"PeriodicalIF":4.2,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144858353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diverse Information Aggregation with Adaptive Graph Construction and prompts for deepfake detection 基于自适应图构建的多元信息聚合和深度假检测提示
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-08-08 DOI: 10.1016/j.imavis.2025.105682
Zhenhua Bai , Qiangchang Wang , Lu Yang , Xinxin Zhang , Yanbo Gao , Yilong Yin
{"title":"Diverse Information Aggregation with Adaptive Graph Construction and prompts for deepfake detection","authors":"Zhenhua Bai ,&nbsp;Qiangchang Wang ,&nbsp;Lu Yang ,&nbsp;Xinxin Zhang ,&nbsp;Yanbo Gao ,&nbsp;Yilong Yin","doi":"10.1016/j.imavis.2025.105682","DOIUrl":"10.1016/j.imavis.2025.105682","url":null,"abstract":"<div><div>Due to the misuse of face manipulation techniques, there has been increasing attention on deepfake detection. Recently, some methods have employed ViTs to capture the inconsistency in forged faces, providing a global perspective for exploring diverse and generalized patterns to avoid overfitting. These methods typically divided an image into fixed-shape patches. However, each patch contains only a tiny fraction of facial regions, thereby inherently lacking explicit semantic and structural relations with other patches, which is insufficient to model the global context information effectively. To enhance the global context interaction, a Diverse INformation Aggregation (DINA) framework is proposed for deepfake detection, which consists of two information aggregation modules: Adaptive Graph Convolution Network (AGCN) and Multi-Scale Prompt Fusion (MSPF). Specifically, the AGCN utilizes a novel strategy to construct neighbors of each token based on spatial and feature relations. Then, a graph convolution network is applied to aggregate information from different tokens to form a token with rich semantics and local information, termed the group token. These group tokens can be used to form robust representations of global information. Moreover, the MSPF utilizes prompts to incorporate unique forgery traces from complementary information, i.e., multi-scale and frequency information, into group tokens in a fine-grained and adaptive manner, which provides extra information to further improve the robustness of group tokens. Consequently, our model can learn robust global context-aware representations, capturing more generalized forgery patterns from global information. The proposed framework outperforms the state-of-the-art competitors on several benchmarks, showing the generalization ability of our method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105682"},"PeriodicalIF":4.2,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144809470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信