Haijun Xiong, Bin Feng, Bang Wang, Xinggang Wang, Wenyu Liu
{"title":"MambaGait: Gait recognition approach combining explicit representation and implicit state space model","authors":"Haijun Xiong, Bin Feng, Bang Wang, Xinggang Wang, Wenyu Liu","doi":"10.1016/j.imavis.2025.105597","DOIUrl":"10.1016/j.imavis.2025.105597","url":null,"abstract":"<div><div>Gait recognition aims to identify pedestrians based on their unique walking patterns and has gained significant attention due to its wide range of applications. Mamba, a State Space Model, has shown great potential in modeling long sequences. However, its limited ability to capture local details hinders its effectiveness in fine-grained tasks like gait recognition. Moreover, similar to convolutional neural networks and transformers, Mamba primarily relies on implicit learning, which is constrained by the sparsity of binary silhouette sequences. Inspired by explicit feature representations in scene rendering, we introduce a novel gait descriptor, the Explicit Spatial Representation Field (ESF). It represents silhouette images as directed distance fields, enhancing their sensitivity to gait motion and facilitating richer spatiotemporal feature extraction. To further improve Mamba’s ability to capture local details, we propose the Temporal Window Switch Mamba Block (TWSM), which effectively extracts local and global spatiotemporal features via bidirectional temporal window switching. By combining explicit representation and implicit Mamba modeling, MambaGait achieves state-of-the-art performance on four challenging datasets (GREW, Gait3D, CCPG, and SUSTech1K). Code: <span><span>https://github.com/Haijun-Xiong/MambaGait</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105597"},"PeriodicalIF":4.2,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ADVC: Adversarial dense video captioning with unsupervised pretraining","authors":"Wangyu Choi , Jiasi Chen , Jongwon Yoon","doi":"10.1016/j.imavis.2025.105595","DOIUrl":"10.1016/j.imavis.2025.105595","url":null,"abstract":"<div><div>Dense video captioning involves detecting and describing events that represent a video story in untrimmed videos using sentences. This task holds great promise for various video analytics-related applications. However, the nondeterministic nature of dense video captioning poses challenges in generating realistic events and captions. Recently, with the advent of large-scale video datasets, pretraining approaches have emerged. Nevertheless, these methods still require strict supervision and often lack accurate localization or are tightly coupled with localization and captioning. To address these challenges, this paper introduces ADVC, a novel approach for dense video captioning that combines unsupervised pre-training and adversarial adaptation. ADVC learns from readily available unlabeled videos and text corpora at scale, thereby reducing the need for strict supervision. It achieves realistic outcomes by directly learning the distribution of human-annotated events and captions through adversarial adaptation. Adversarial adaptation allows for the decoupling of localization and captioning subtasks while effectively considering their interdependence. We evaluate the performance of ADVC using multiple benchmark datasets to showcase the efficacy of our unsupervised pre-training and adversarial adaptation approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105595"},"PeriodicalIF":4.2,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Similarity verification of kinship pairs using metricized emphasis","authors":"Chhavi Maheshwari , Siddhanth Bhat , Praveen Kumar Shukla , Madhu Oruganti , Vijaypal Singh Dhaka","doi":"10.1016/j.imavis.2025.105619","DOIUrl":"10.1016/j.imavis.2025.105619","url":null,"abstract":"<div><div>Kinship verification is the determination of the validity of biological ties or kinship between two or more individuals, giving insights about genetic trait inheritances and other applications like forensic investigations. This paper presents a deep learning approach to kinship verification that methodically evaluates the similarity between images of kin. The proposed approach, Age-Modified Metricized Filtering (AMMF), begins by augments images via a Cycle-Generative Adversarial Network setup for aging child images, which increases facial parameters and reduces age gap. It then quantifies genetic inheritance by a novel method, Metricized Weight-based Emphasis Filtering, which reconciles facial proportions between the older and younger generation, and then uses Siamese networks for feature embedding and similarity evaluation. The approach is evaluated on a merged dataset of KinFaceW-I and KinFaceW-II, and achieves state-of-the-art performance. The results are suitable for real-world applications, achieving a training accuracy, AUC, and contrastive loss of 97.4%, 0.74 and 0.11 respectively. The approach also achieves 89.41% and 87.86% training accuracy on FIW and TSKinFace datasets respectively. This will contribute toward an accurate determination of the validity of kinship ties, thus contributing to tasks like image management, genealogical research, and criminal investigations.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105619"},"PeriodicalIF":4.2,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boundary-and-object collaborative learning network for camouflaged object detection","authors":"Chenyu Zhuang, Qing Zhang, Chenxi Zhang, Xinxin Yuan","doi":"10.1016/j.imavis.2025.105596","DOIUrl":"10.1016/j.imavis.2025.105596","url":null,"abstract":"<div><div>Existing camouflaged object detection (COD) approaches have achieved remarkable success in detecting and segmenting camouflaged objects that visually blend into the surroundings. However, there are still some challenging and critical issues, including inaccurate localization of target objects with varying scales, and incomplete identification of subtle details. To address these problems, we propose a novel boundary-and-object collaborative learning network (BCLNet) for camouflaged object detection, which simultaneously extracts and progressively refines the position and detail information to ensure segmentation results with uniform interiors and clear boundaries. Specifically, we design the Adaptive Feature Learning (AFL) module to generate the boundary information for identifying the details and the object information for positioning the target objects, and then optimize the two types of features in an interactive learning manner. In this way, the boundary feature and the object feature are able to learn from each other and compensate deficiencies for themselves, thus improving the semantic and detail representation. Moreover, to fully explore the complementarity between the cross-level features, we propose the Boundary-guided Selective Fusion (BSF) module to introduce the boundary cue to help the cross-level feature integration, enriching the semantic information while preserving the detail information. Extensive experimental results demonstrate that our BCLNet outperforms the state-of-the-art COD methods on four widely used datasets. The link to our code and prediction maps are available at <span><span>https://github.com/ZhangQing0329/BCLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105596"},"PeriodicalIF":4.2,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144230941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic scene graph generation based on an edge dual scene graph and message passing neural network","authors":"Hyeongjin Kim, Byoung Chul Ko","doi":"10.1016/j.imavis.2025.105572","DOIUrl":"10.1016/j.imavis.2025.105572","url":null,"abstract":"<div><div>Along with generative AI, interest in scene graph generation (SGG), which comprehensively captures the relationships and interactions between objects in an image and creates a structured graph-based representation, has significantly increased in recent years. However, relying on object-centric and dichotomous relationships, existing SGG methods have a limited ability to accurately predict detailed relationships. To solve these problems, a new approach to the modeling multi-object relationships, called edge dual scene graph generation (EdgeSGG), is proposed herein. EdgeSGG is based on an edge dual scene graph and object-relation centric message passing neural network (OR-MPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed OR-MPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on imbalanced class distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems. Our code is available at <span><span>https://github.com/Chocolate-Love/EdgeSGG-pytorch</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105572"},"PeriodicalIF":4.2,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144194677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianming Zhang , Jiangxin Dai , Wentao Chen , Ke Nai
{"title":"Learning disturbance-aware correlation filter with adaptive Kaiser window for visual object tracking","authors":"Jianming Zhang , Jiangxin Dai , Wentao Chen , Ke Nai","doi":"10.1016/j.imavis.2025.105585","DOIUrl":"10.1016/j.imavis.2025.105585","url":null,"abstract":"<div><div>Discriminative Correlation Filters (DCF) have been recognized as a classic and effective method in the field of object tracking. In order to mitigate boundary effects, prior DCF-based tracking methods have commonly employed a fixed Hanning window, limiting the adaptability to fluctuations of the response map. Therefore, we propose a disturbance-aware correlation filter with adaptive Kaiser window (DCFAK) for visual object tracking. The adaptive Kaiser window dynamically adjusts its values according to the kurtosis of the response map, effectively suppressing boundary effects. Additionally, to further improve robustness, our DCFAK introduces a disturbance peaks suppression method, which can better distinguish the target object from the objects with similar appearance in the background by attenuating the sub-peaks within the response map. We comprehensively evaluate the performance of our DCFAK on seven datasets, including OTB-2013, OTB, 2015, TC-128, DroneTB, 70, UAV123, UAVDT, and LaSOT. The results demonstrate the superior performance of our method across these datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105585"},"PeriodicalIF":4.2,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144194853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanliang Ge , Jiaxue Chen , Taichuan Liang , Yuxi Zhong , Hongbo Bi , Qiao Zhang
{"title":"Consensus exploration and detail perception for co-salient object detection in optical remote sensing images","authors":"Yanliang Ge , Jiaxue Chen , Taichuan Liang , Yuxi Zhong , Hongbo Bi , Qiao Zhang","doi":"10.1016/j.imavis.2025.105586","DOIUrl":"10.1016/j.imavis.2025.105586","url":null,"abstract":"<div><div>Co-salient object detection (CoSOD) in optical remote sensing images (ORSI) aims to identify common salient objects across a set of related images. To address this, we introduce the first large-scale dataset, CoORSI, comprising 7668 high-quality images annotated with target masks, covering various macroscopic geographic scenes and man-made targets. Furthermore, we propose a novel network, Consensus Exploration and Detail Perception Network (CEDPNet), specifically designed for CoSOD in ORSI. CEDPNet incorporates a Collaborative Object Search Module (COSM) to integrate high-level features and explore collaborative objects, and a Feature Sensing Module (FSM) to enhance salient target perception through difference contrast enhancement and multi-scale detail boosting. By continuously fusing high-level semantic information with low-level detailed features, CEDPNet achieves accurate co-salient object detection. Extensive experiments demonstrate that CEDPNet significantly outperforms state-of-the-art methods on six evaluation metrics, underscoring its effectiveness for CoSOD in ORSI. The CoORSI dataset, model, and results will be publicly available at <span><span>https://github.com/chen000701/CEDPNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105586"},"PeriodicalIF":4.2,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144203555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyou Guo , Qilei Li , Mingliang Gao , Xianxun Zhu , Imad Rida
{"title":"Generalizable deepfake detection via Spatial Kernel Selection and Halo Attention Network","authors":"Siyou Guo , Qilei Li , Mingliang Gao , Xianxun Zhu , Imad Rida","doi":"10.1016/j.imavis.2025.105582","DOIUrl":"10.1016/j.imavis.2025.105582","url":null,"abstract":"<div><div>The rapid advancement of AI-Generated Content (AIGC) has enabled the unprecedented synthesis of photorealistic facial images. While these technologies offer transformative potential for creative industries, they also introduce significant risks due to the malicious manipulation of visual media. Current deepfake detection methods struggle with unseen forgeries due to their inability to consider the effects of spatial receptive fields and local representation learning. To bridge these gaps, this paper proposes a Spatial Kernel Selection and Halo Attention Network (SKSHA-Net) for deepfake detection. The proposed model incorporates two key modules, namely Spatial Kernel Selection (SKS) and Halo Attention (HA). The SKS module dynamically adjusts the spatial receptive field to capture subtle artifacts indicative of forgery. The HA module focuses on the intricate relationships between neighboring pixels for local representation learning. Comparative experiments on three public datasets demonstrate that SKSHA-Net outperforms the state-of-the-art (SOTA) methods in both intra-testing and cross-testing.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105582"},"PeriodicalIF":4.2,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Masked Graph Attention network for classification of facial micro-expression","authors":"Ankith Jain Rakesh Kumar, Bir Bhanu","doi":"10.1016/j.imavis.2025.105584","DOIUrl":"10.1016/j.imavis.2025.105584","url":null,"abstract":"<div><div>Facial micro-expressions (MEs) are ultra-fine, quick, and short-motion muscle movements expressing a person’s true feelings. Automatic recognition of MEs with only a few samples is challenging and the extraction of subtle features becomes crucial. This paper addresses these intricacies and presents a novel dual-branch (branch1 for node locations and branch2 for optical flow patch information) masked graph attention network-based approach (MaskGAT) to classify MEs in a video. It utilizes a three-frame graph structure to extract spatio-temporal information. It learns a mask for each node to eliminate the less important node features and propagates the important node features to the neighboring nodes. A masked self-attention graph pooling layer is designed to provide the attention score to eliminate the unwanted nodes and uses only the nodes with a high attention score. An adaptive frame selection mechanism is designed that is based on a sliding window optical flow method to discard the low-intensity emotion frames. A well-designed dual-branch fusion mechanism is developed to extract informative features for the final classification of MEs. Furthermore, the paper presents a detailed mathematical analysis and visualization of the MaskGAT pipeline to demonstrate the effectiveness of node feature masking and pooling. The results are presented and compared with the state-of-the-art methods for SMIC, SAMM, CASME II, and MMEW databases. Further, cross-dataset experiments are carried out, and the results are reported.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105584"},"PeriodicalIF":4.2,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaopeng Sha , Xiaopeng Si , Yujie Zhu , Shuyu Wang , Yuliang Zhao
{"title":"Automatic three-dimensional reconstruction of transparent objects with multiple optimization strategies under limited constraints","authors":"Xiaopeng Sha , Xiaopeng Si , Yujie Zhu , Shuyu Wang , Yuliang Zhao","doi":"10.1016/j.imavis.2025.105580","DOIUrl":"10.1016/j.imavis.2025.105580","url":null,"abstract":"<div><div>Reconstructing transparent objects with limited constraints has long been considered a highly challenging problem. Due to the complex interaction between transparent objects and light, which involves intricate refraction and reflection relationships, traditional three-dimensional (3D) reconstruction methods are less than effective for transparent objects. To address this issue, this study proposes a 3D reconstruction method specifically designed for transparent objects. Incorporating multiple optimization strategies, the method works under limited constraints to achieve the automatic reconstruction of transparent objects with only a few transparent object images in any known environment, without the need for specific data collection devices or environments. The proposed method makes use of automatic image segmentation and modifies the network interface and structure of the PointNeXt algorithm to introduce the TransNeXt network, which enhances normal features, optimizes weight attenuation, and employs a preheating cosine annealing learning rate. We use several steps to reconstruct the complete 3D shape of transparent objects. First, we initialize the transparent shape with a visual hull reconstructed with the contours obtained by the TOM-Net. Then, we construct the normal reconstruction network to estimate the normal values. Finally, we reconstruct the complete 3D shape using the TransNeXt network. Multiple experiments show that the TransNeXt network exhibits superior reconstruction performance to other networks and can effectively perform the automatic reconstruction of transparent objects even under limited constraints.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"160 ","pages":"Article 105580"},"PeriodicalIF":4.2,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144135055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}