Siyou Guo , Qilei Li , Mingliang Gao , Xianxun Zhu , Imad Rida
{"title":"Generalizable deepfake detection via Spatial Kernel Selection and Halo Attention Network","authors":"Siyou Guo , Qilei Li , Mingliang Gao , Xianxun Zhu , Imad Rida","doi":"10.1016/j.imavis.2025.105582","DOIUrl":"10.1016/j.imavis.2025.105582","url":null,"abstract":"<div><div>The rapid advancement of AI-Generated Content (AIGC) has enabled the unprecedented synthesis of photorealistic facial images. While these technologies offer transformative potential for creative industries, they also introduce significant risks due to the malicious manipulation of visual media. Current deepfake detection methods struggle with unseen forgeries due to their inability to consider the effects of spatial receptive fields and local representation learning. To bridge these gaps, this paper proposes a Spatial Kernel Selection and Halo Attention Network (SKSHA-Net) for deepfake detection. The proposed model incorporates two key modules, namely Spatial Kernel Selection (SKS) and Halo Attention (HA). The SKS module dynamically adjusts the spatial receptive field to capture subtle artifacts indicative of forgery. The HA module focuses on the intricate relationships between neighboring pixels for local representation learning. Comparative experiments on three public datasets demonstrate that SKSHA-Net outperforms the state-of-the-art (SOTA) methods in both intra-testing and cross-testing.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105582"},"PeriodicalIF":4.2,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Masked Graph Attention network for classification of facial micro-expression","authors":"Ankith Jain Rakesh Kumar, Bir Bhanu","doi":"10.1016/j.imavis.2025.105584","DOIUrl":"10.1016/j.imavis.2025.105584","url":null,"abstract":"<div><div>Facial micro-expressions (MEs) are ultra-fine, quick, and short-motion muscle movements expressing a person’s true feelings. Automatic recognition of MEs with only a few samples is challenging and the extraction of subtle features becomes crucial. This paper addresses these intricacies and presents a novel dual-branch (branch1 for node locations and branch2 for optical flow patch information) masked graph attention network-based approach (MaskGAT) to classify MEs in a video. It utilizes a three-frame graph structure to extract spatio-temporal information. It learns a mask for each node to eliminate the less important node features and propagates the important node features to the neighboring nodes. A masked self-attention graph pooling layer is designed to provide the attention score to eliminate the unwanted nodes and uses only the nodes with a high attention score. An adaptive frame selection mechanism is designed that is based on a sliding window optical flow method to discard the low-intensity emotion frames. A well-designed dual-branch fusion mechanism is developed to extract informative features for the final classification of MEs. Furthermore, the paper presents a detailed mathematical analysis and visualization of the MaskGAT pipeline to demonstrate the effectiveness of node feature masking and pooling. The results are presented and compared with the state-of-the-art methods for SMIC, SAMM, CASME II, and MMEW databases. Further, cross-dataset experiments are carried out, and the results are reported.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105584"},"PeriodicalIF":4.2,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaopeng Sha , Xiaopeng Si , Yujie Zhu , Shuyu Wang , Yuliang Zhao
{"title":"Automatic three-dimensional reconstruction of transparent objects with multiple optimization strategies under limited constraints","authors":"Xiaopeng Sha , Xiaopeng Si , Yujie Zhu , Shuyu Wang , Yuliang Zhao","doi":"10.1016/j.imavis.2025.105580","DOIUrl":"10.1016/j.imavis.2025.105580","url":null,"abstract":"<div><div>Reconstructing transparent objects with limited constraints has long been considered a highly challenging problem. Due to the complex interaction between transparent objects and light, which involves intricate refraction and reflection relationships, traditional three-dimensional (3D) reconstruction methods are less than effective for transparent objects. To address this issue, this study proposes a 3D reconstruction method specifically designed for transparent objects. Incorporating multiple optimization strategies, the method works under limited constraints to achieve the automatic reconstruction of transparent objects with only a few transparent object images in any known environment, without the need for specific data collection devices or environments. The proposed method makes use of automatic image segmentation and modifies the network interface and structure of the PointNeXt algorithm to introduce the TransNeXt network, which enhances normal features, optimizes weight attenuation, and employs a preheating cosine annealing learning rate. We use several steps to reconstruct the complete 3D shape of transparent objects. First, we initialize the transparent shape with a visual hull reconstructed with the contours obtained by the TOM-Net. Then, we construct the normal reconstruction network to estimate the normal values. Finally, we reconstruct the complete 3D shape using the TransNeXt network. Multiple experiments show that the TransNeXt network exhibits superior reconstruction performance to other networks and can effectively perform the automatic reconstruction of transparent objects even under limited constraints.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105580"},"PeriodicalIF":4.2,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144135055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hybrid approach combining images and questionnaires for early detection and severity assessment of Autism Spectrum Disorder","authors":"Rajkumar S.C. , Stefano Cirillo , Yuvasini D. , Luisa Solimando","doi":"10.1016/j.imavis.2025.105547","DOIUrl":"10.1016/j.imavis.2025.105547","url":null,"abstract":"<div><div>In this research, we propose a novel integrated system for the early diagnosis and cognitive enhancement of infants with Autism Spectrum Disorder (ASD). The system combines two core modules: the Behavioral Analytic Module and the Cognitive Skill Enhancement Module. The Behavioral Analytic Module includes a Questionnaire Analysis Sub-module, which utilizes Random Forest classifiers to analyze questionnaire data, and an Image Analysis Sub-module, which employs a fine-tuned VGG16 Convolutional Neural Network to process facial images. These sub-modules independently assess ASD likelihood and combine their outputs to generate a comprehensive diagnosis using a weighted averaging technique. The Cognitive Skill Enhancement Module integrates interactive games and web-based animations designed to improve cognitive abilities and daily living skills in toddlers with ASD. Additionally, a reward system is incorporated to reinforcement learning outcomes, adaptively calculating rewards based on the infants’ progress. The proposed system aims to provide a holistic approach to ASD diagnosis and intervention, offering an effective tool for early detection and tailored cognitive development. The system’s efficacy is demonstrated through comparative analysis, showing a 93% improvement in diagnostic accuracy and a 92% enhancement in cognitive skill development among toddlers with ASD.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105547"},"PeriodicalIF":4.2,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FaiResGAN: Fair and robust blind face restoration with biometrics preservation","authors":"George Azzopardi , Antonio Greco , Mario Vento","doi":"10.1016/j.imavis.2025.105575","DOIUrl":"10.1016/j.imavis.2025.105575","url":null,"abstract":"<div><div>Modern computer vision technologies enable systems to detect, recognize, and analyze facial features, but challenges arise when images are noisy, blurred, or low quality. Blind face restoration, which aims to recover high-quality facial images without prior knowledge of degradation, addresses this issue. In this paper, we introduce Fair Restoration GAN (FaiResGAN), a novel Generative Adversarial Network (GAN) designed to balance face restoration with the preservation of soft biometrics (identity, ethnicity, age, and gender). Our model incorporates a pseudo-random batch composition algorithm to promote fairness and mitigate bias, alongside a realistic degradation model simulating corruptions typical in surveillance images. Experimental results show that FaiResGAN outperforms state-of-the-art blind face restoration methods, both quantitatively and qualitatively. A user study involving 40 participants showed that FaiResGAN-restored images were preferred by 70% of users. Additionally, tests on VGGFace2, UTKFace, and FairFace datasets demonstrate FaiResGAN’s superior performance in preserving soft biometric attributes and ensuring fair restoration across different genders and ethnicities.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105575"},"PeriodicalIF":4.2,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144135056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAGNet: Synergistic Attention-Graph Network For video salient object detection","authors":"Huo Lina, Xueyuan Gao, Wei Wang, Ke Chen, Ke Wang","doi":"10.1016/j.imavis.2025.105570","DOIUrl":"10.1016/j.imavis.2025.105570","url":null,"abstract":"<div><div>In the field of video salient object detection (VSOD), accurately capturing motion information is essential. Previous approaches primarily rely on optical flow, convolutional long short term memory (ConvLSTM), or 3D convolutional neural network (CNN) to extract and utilize motion information. However, these methods capture limited motion details and increase the parameters in the network. Moreover, Transformer-based methods, while effective in high-level feature modeling, suffer from excessive computational complexity and insufficient local feature extraction, limiting their practical application in VSOD. To address these challenges, we propose a novel synergistic attention-graph network (SAGNet) that independently distills spatial–temporal cues and spatial edge features using the synergistic attention-graph module (SAGM) and the spatial edge attention module (SEM), respectively. SAGM innovatively integrates inter-frame attention with spatial–temporal graph convolution network (GCN). The inter-frame attention proposed in SAGM captures motion information between video frames while expanding the receptive field to capture long-range dependencies. Spatial–temporal GCN models video as a graph, bridge features from temporal into spatial branch, which is capable of fusing cross-modal features collaboratively. This synergy enables SAGNet to consider both global and local spatial–temporal features. SEM enhances high-level information by extracting spatial and edge features from the low-level data using the Sobel operator and spatial attention module. Experimental results on several publicly available VSOD benchmark datasets demonstrate that SAGNet outperforms existing methods in terms of detection accuracy and efficiency, confirming its superiority and practicality in VSOD.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105570"},"PeriodicalIF":4.2,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144115095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoxia Yang , Zhishuai Zheng , Huanqi Zheng , Zhedong Ge , Xiaotong Liu , Bei Zhang , Jinyang Lv
{"title":"LCNet: Lightweight real-time image classification network based on efficient multipath dynamic attention mechanism and dynamic threshold convolution","authors":"Xiaoxia Yang , Zhishuai Zheng , Huanqi Zheng , Zhedong Ge , Xiaotong Liu , Bei Zhang , Jinyang Lv","doi":"10.1016/j.imavis.2025.105576","DOIUrl":"10.1016/j.imavis.2025.105576","url":null,"abstract":"<div><div>Hybrid architectures that integrate convolutional neural networks (CNNs) with Transformers can comprehensively extract both local and global image features, exhibiting impressive performance in image classification. However, their large parameter sizes and high computational demands hinder deployment on low-resource devices. To address this limitation, we propose a dual-branch classification network based on a pyramid architecture, termed LCNet. First, we introduce a dynamic threshold convolution module that adaptively adjusts convolutional parameters based on the input, thereby improving the efficiency of feature extraction. Second, we design a multi-path dynamic attention mechanism that optimizes attention weights to capture salient information and enhance the significance of key features. Third, a star-shaped connection is adopted to enable efficient information fusion between the two branches in a high-dimensional implicit feature space. LCNet is evaluated on four public datasets and one wood dataset (Tiny-ImageNet, Mini-ImageNet, CIFAR100, CIFAR10, and Micro-CT) using recognition accuracy and inference efficiency as metrics. The results show that LCNet achieves a maximum accuracy of 99.50% with an inference time of only 0.0072 s per image, outperforming other state-of-the-art (SOTA) models. Extensive experiments demonstrate that LCNet is more competitive than existing neural networks and can be effectively deployed on low-performance computing devices. This broadens the applicability of image classification techniques, aligns with the trend of edge computing, reduces reliance on cloud servers, and enhances both real-time processing and data privacy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105576"},"PeriodicalIF":4.2,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Preethi , V. Govindaraj , S. Dhanasekar , K. Martin Sagayam , Syed Immamul Ansarullah , Farhan Amin , Isabel de la Torre D'ıez , Carlos Osorio Garc'ıa , Alina Eugenia Pascual Barrera , Fehaid Salem Alshammari
{"title":"Deep learning-assisted 3D model for the detection and classification of knee arthritis","authors":"D. Preethi , V. Govindaraj , S. Dhanasekar , K. Martin Sagayam , Syed Immamul Ansarullah , Farhan Amin , Isabel de la Torre D'ıez , Carlos Osorio Garc'ıa , Alina Eugenia Pascual Barrera , Fehaid Salem Alshammari","doi":"10.1016/j.imavis.2025.105574","DOIUrl":"10.1016/j.imavis.2025.105574","url":null,"abstract":"<div><div>Osteoarthritis (OA) affects nearly 240 million people worldwide. It is a common degenerative illness that typically affects the knee joint OA causes pain, and functional disability, especially in older adults is a common disease. One of the most common and challenging medical conditions to deal with in old-aged people is the occurrence of knee osteoarthritis (KOA). Manual diagnosis involves observing X-ray images of the knee area and classifying it into different five grades. This requires the physician's expertise, suitable experience, and a lot of time, and even after that, the diagnosis can be prone to errors. Therefore, researchers in the machine learning (ML) and deep learning (DL) domains have employed the capabilities of deep neural network (DNN) models to identify and classify medical images in an automated, faster, and more accurate manner. Combining multiple imaging modalities or utilizing three-dimensional reconstructions can enhance the accuracy and completeness of 2D Images in diagnostic information. Hence to overcome the drawbacks of 2D imaging, the reconstruction of 3D models using 2D images is the main theme of our research. In this paper, we propose a deep learning-based model for the detection and classification of the early diagnosis of arthritis. It is a four-step procedure starting with data collection followed by data conversion. In this step, our proposed model deforms the target's convex hull to produce a 3D model. Herein, a series of 2D photos is utilized, along with surface rendering methods, to create a 3D model. In the third step, the feature extraction is performed followed by mesh refinement. The chamfer loss is optimized based on the rotational shape of the leg bones, and subsequently, the weight of the loss function can be allocated to the target's geometric properties. We have used a modified Gray Level Co-occurrence Matrix (GLCM) for feature extraction. In the fourth step, the image classification is performed and the suggested optimization strategy raises the model's accuracy. A comparison of results with current 3D reconstruction techniques proves that the suggested method can consistently produce a waterproof model with a greater reconstruction accuracy. The deep-seated intricacies and distinct patterns across arthritic phases are estimated through the extraction of complicated statistical variables combined with power spectral density. The high-dimensional data is divided into separate, easily observable groups using the Lion Optimization Algorithm and proposed distance metric. The F1 Score and Jaccard Metric showed an average of 0.85 and 0.23, indicating effective differentiation across clusters.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105574"},"PeriodicalIF":4.2,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144083687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M3IF-NSST-MTV: Modified Total variation-based multi-modal medical image fusion using Laplacian energy and morphology in the NSST domain","authors":"Dev Kumar Chaudhary , Prabhishek Singh , Achyut Shankar , Manoj Diwakar","doi":"10.1016/j.imavis.2025.105581","DOIUrl":"10.1016/j.imavis.2025.105581","url":null,"abstract":"<div><div>This paper presents a new multi-modal medical image fusion (M3IF) technique that fuses the medical images obtained from different medical imaging modalities, such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Single Photon Emission Computed Tomography (SPECT) or Positron Emission Tomography (PET), into a single image. This single image is enhanced and contains all the important information of the source images. This paper proposes a hybrid M3IF technique, i.e., M3IF-NSST-MTV, where input medical images are decomposed using Non-Subsampled Shearlet Transform (NSST). It decomposes the image into low frequency coefficients (LFCs), and high frequency coefficients (HFCs). The LFCs are fused using Laplacian energy, and HFCs are fused using morphology. The fused image obtained after applying inverse-NSST is directed to the modified Total Variation (TV), that refines the NSST output. This modified TV output is again fused with NSST output using Feature Similarity Index Measure (FSIM) with Correlation Coefficient (CC)-based threshold value. This modified TV refinement process is iterative process. The results of M3IF-NSST-MTV are evaluated at the pre-set number of iterations = 200. The final fusion results of M3IF-NSST-MTV are compared with some of the prevalent non-traditional methods and based on visual quality and quantitative metric-based analysis; it is found that the M3IF-NSST-MTV delivers better results than all the compared methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105581"},"PeriodicalIF":4.2,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143948162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Weakly supervised camouflaged object detection based on the SAM model and mask guidance","authors":"Xia Li, Xinran Liu, Lin Qi, Junyu Dong","doi":"10.1016/j.imavis.2025.105571","DOIUrl":"10.1016/j.imavis.2025.105571","url":null,"abstract":"<div><div>Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module (CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105571"},"PeriodicalIF":4.2,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143948241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}