Xiaogang Song , Junjie Tang , Kaixuan Yang , Weixuan Guo , Xiaofeng Lu , Xinhong Hei
{"title":"A method for absolute pose regression based on cascaded attention modules","authors":"Xiaogang Song , Junjie Tang , Kaixuan Yang , Weixuan Guo , Xiaofeng Lu , Xinhong Hei","doi":"10.1016/j.cviu.2025.104440","DOIUrl":"10.1016/j.cviu.2025.104440","url":null,"abstract":"<div><div>The absolute camera pose regression estimates the position and orientation of the camera solely based on captured RGB images. However, current single-image techniques often lack robustness, resulting in significant outliers. To address the issues of pose regressors in repetitive textures and dynamic blur scenarios, this paper proposes an absolute pose regression method based on cascaded attention modules. This network integrates global and local information through cascaded attention modules and then employs a dual-stream attention module to reduce the impact of dynamic objects and lighting changes on localization performance by constructing dual-channel dependencies. Specifically, the cascaded attention modules guide the model to focus on the relationships between global and local features and establish long-range channel dependencies, enabling the network to learn richer multi-scale feature representations. Additionally, a dual-stream attention module is introduced to further enhance feature representation by closely associating spatial and channel dimensions. This method is evaluated and analyzed on various indoor and outdoor datasets, with our method reducing the median position error and orientation error to 0.19 m/<span><math><mrow><mn>7</mn><mo>.</mo><mn>44</mn><mo>°</mo></mrow></math></span> on 7-Scenes and 7.09 m/<span><math><mrow><mn>1</mn><mo>.</mo><mn>45</mn><mo>°</mo></mrow></math></span> on RobotCar, demonstrating that the proposed method can significantly improve localization performance. Ablation studies on multiple categories further verify the effectiveness of the proposed modules.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104440"},"PeriodicalIF":4.3,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144595956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaotao Shao , Guipeng Zhang , Yan Shen , Boyu Zhang , Zhongli Wang , Yanlong Sun
{"title":"AdaptDiff: Adaptive diffusion learning for low-light image enhancement","authors":"Xiaotao Shao , Guipeng Zhang , Yan Shen , Boyu Zhang , Zhongli Wang , Yanlong Sun","doi":"10.1016/j.cviu.2025.104439","DOIUrl":"10.1016/j.cviu.2025.104439","url":null,"abstract":"<div><div>Recovering details obscured by noise from low-light images is a challenging task. Recent diffusion models have achieved relatively promising results in low-level vision tasks. However, there are still two issues: (1) under non-uniform illumination conditions, the low-light image cannot be restored with high quality, and (2) the models have limited generalization capabilities. To solve these problems, this paper proposes an Adaptive Enhancement Algorithm guided by a Multi-scale Structural Diffusion (AdaptDiff). AdaptDiff employs adaptive high-order mapping curves (AHMC) for pixel-by-pixel mapping of the image during the diffusion process, thereby adjusting the brightness levels between different regions within the image. In addition, a multi-scale structural guidance approach (MSGD) is proposed as an implicit bias, informing the intermediate layers of the model about the structural characteristics of the image, facilitating more effective restoration of clear images. Guiding the diffusion direction through structural information is conducive to maintaining good performance of the model even when faced with data that it has not previously encountered. Extensive experiments on popular benchmarks show that AdaptDiff achieves superior performance and efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104439"},"PeriodicalIF":4.3,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144595955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distribution-aware contrastive learning for domain adaptation in 3D LiDAR segmentation","authors":"Lamiae El Mendili, Sylvie Daniel, Thierry Badard","doi":"10.1016/j.cviu.2025.104438","DOIUrl":"10.1016/j.cviu.2025.104438","url":null,"abstract":"<div><div>Semantic segmentation of 3D LiDAR point clouds is very important for applications like autonomous driving and digital twins of cities. However, current deep learning models suffer from a significant generalization gap. Unsupervised Domain Adaptation methods have recently emerged to tackle this issue. While domain-invariant feature learning using Maximum Mean Discrepancy has shown promise for images due to its simplicity, its application remains unexplored in outdoor mobile mapping point clouds. Moreover, previous methods do not consider the class information, which can lead to suboptimal adaptation performance. We propose a new approach—Contrastive Maximum Mean Discrepancy—to maximize intra-class domain alignment and minimize inter-class domain discrepancy, and integrate it into a 3D semantic segmentation model for LiDAR point clouds. The evaluation of our method with large-scale UDA datasets shows that it surpasses state-of-the-art UDA approaches for 3D LiDAR point clouds. CMMD is a promising UDA approach with strong potential for point cloud semantic segmentation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104438"},"PeriodicalIF":4.3,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Emporio , Amirpouya Ghasemaghaei , Joseph J. Laviola Jr. , Andrea Giachetti
{"title":"Continuous hand gesture recognition: Benchmarks and methods","authors":"Marco Emporio , Amirpouya Ghasemaghaei , Joseph J. Laviola Jr. , Andrea Giachetti","doi":"10.1016/j.cviu.2025.104435","DOIUrl":"10.1016/j.cviu.2025.104435","url":null,"abstract":"<div><div>In this paper, we review the existing benchmarks for continuous gesture recognition, e.g., the online analysis of hand movements over time to detect and recognize meaningful gestures from a specific dictionary. Focusing on human–computer interaction scenarios, we classify these benchmarks based on input data types, gesture dictionaries, and evaluation metrics. Specific metrics for the continuous recognition task are crucial for understanding how effectively gestures are spotted in real time within input streams. We also discuss the most effective detection and classification methods proposed for these benchmarks. Our findings indicate that the number and quality of publicly available datasets remain limited, and evaluation methodologies for continuous recognition are not yet standardized. These issues highlight the need for new benchmarks that reflect real-world usage conditions and can support the development of best practices in gesture-based interface design.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104435"},"PeriodicalIF":4.3,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144569865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking the sparse mask learning mechanism in sparse convolution for object detection on drone images","authors":"Yixuan Li , Pengnian Wu , Meng Zhang","doi":"10.1016/j.cviu.2025.104432","DOIUrl":"10.1016/j.cviu.2025.104432","url":null,"abstract":"<div><div>Although sparse convolutional neural networks have achieved significant progress in fast object detection on high-resolution drone images, the research community has yet to pay enough attention to the great potential of prior knowledge (i.e., local contextual information) in UAV imagery for assisting sparse masks to improve detector performance. Such prior knowledge is beneficial for object detection in complex drone imagery, as tiny objects may be mistakenly detected or even missed entirely without referencing the local context surrounding them. In this paper, we take these priors into account and propose a crucial region learning strategy for sparse masks to boost object detection performance. Specifically, we extend the mask region from the feature region of the objects to their surrounding local context region and introduce a method for selecting and evaluating this local context region. Furthermore, we propose a novel mask-matching constraint to replace the mask activation ratio constraint, thereby enhancing object localization accuracy. We extensively evaluate our method across various detectors on two UAV benchmarks: VisDrone and UAVDT. By leveraging our mask learning strategy, the state-of-the-art sparse convolutional framework achieves higher detection gains with a faster detection speed, demonstrating its significant superiority.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104432"},"PeriodicalIF":4.3,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fangkai Li , Feiyu Pan , Wenjia Meng , Haoliang Sun , Xiushan Nie , Yilong Yin , Xiankai Lu
{"title":"Cross-graph meta matching correction for noisy graph matching","authors":"Fangkai Li , Feiyu Pan , Wenjia Meng , Haoliang Sun , Xiushan Nie , Yilong Yin , Xiankai Lu","doi":"10.1016/j.cviu.2025.104433","DOIUrl":"10.1016/j.cviu.2025.104433","url":null,"abstract":"<div><div>In recent years, significant advancements have been made in image feature point matching within the context of deep graph matching. However, keypoint annotations in images can be inaccurate due to various issues such as occlusion, changes in viewpoint, or poor recognizability, leading to noisy correspondence. To address this limitation, we propose a novel Meta Matching Correction for noisy Graph Matching (MCGM), which introduces meta-learning to mitigate noisy correspondence for the first time. Specifically, we design a Meta Correcting Network (MCN) that integrates global features and geometric consistency information of graphs to generate confidence scores for nodes and edges. Based on the scores, MCN adaptively adjusts and penalizes the noisy assignments, enhancing the model’s ability to handle noisy correspondence. We conduct joint training of the main network and MCN to achieve dynamic correction through a bi-level optimization framework. Experimental evaluations on three public benchmark datasets demonstrate that our proposed method delivers robust performance improvements over state-of-the-art graph matching solutions and exhibits excellent stability when handling images under complex conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104433"},"PeriodicalIF":4.3,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPKDB-Net: A Salient-Part Pose Keypoints-Based Dual-Branch Network for repetitive action counting","authors":"Jinying Wu , Jun Li , Qiming Li","doi":"10.1016/j.cviu.2025.104434","DOIUrl":"10.1016/j.cviu.2025.104434","url":null,"abstract":"<div><div>With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, the existing pose-level methods have some drawbacks, for example, ignoring the fact that occlusion and unfavourable viewing angles in videos lead to affect the accuracy of pose keypoints extraction. To overcome these problems, we propose a simple but efficient Salient-Part Pose Keypoints-Based Dual-Branch Network (SPKDB-Net). Specifically, we design a dual-branch input channel consisting of a global-based and a salient-part input branch. The global-based input branch is used to input the pose keypoints of the whole body extracted by the human pose estimation network, and the salient-part input branch is used to input the salient-part pose keypoints (<em>i.e.</em>, head, shoulders, and hands). The second branch acts as an auxiliary to the first branch, thus effectively addressing the influence of external factors. In addition, we propose a DFEPM-Module that obtains long-distance dependency between pose keypoints through the attention mechanism, and obtains salient local features fused by the attention mechanism through convolution. Eventually, extensive experiments on the challenging RepCount-pose, UCFRep-pose and Countix-Fitness-pose benchmarks show that our proposed SPKDB-Net achieves state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104434"},"PeriodicalIF":4.3,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144522146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ERTFNet: Enhanced RGB-T Fusion Network for semantic segmentation by integrating thermal edge features","authors":"Hanqi Yin , Liguo Zhang , Yiming Sun , Guisheng Yin","doi":"10.1016/j.cviu.2025.104421","DOIUrl":"10.1016/j.cviu.2025.104421","url":null,"abstract":"<div><div>Semantic segmentation is crucial for computer vision, especially in the field of autonomous driving. RGB-Thermal (RGB-T) fusion networks enhance semantic segmentation accuracy in road scenes. However, most existing methods employ the same module structure to extract features from both RGB and thermal images, and all the obtained features are subsequently fused, neglecting the unique characteristics of each modality. Nevertheless, the fused thermal features may introduce noise and redundancy into the network, which is capable of segmenting objects well solely using RGB images. As a result, the performance and accuracy of the approach are limited in complex scenarios. To address this problem, a novel method named Enhanced RGB-T Fusion Network (ERTFNet) is proposed by adopting the encoder–decoder design concept. The constructed encoder in ERTFNet can obtain fused features by combining the extracted edge features from thermal images with RGB image features processed by an attention mechanism. Then, the feature map is restored by a general decoder. Additionally, we introduce the spatial edge constraints during the training stage to further enhance the model’s ability to capture image details and improve both prediction accuracy and boundary clarity. Experiments on two public datasets, compared with existing methods, show that the proposed method can obtain more clear visual contours and higher prediction accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104421"},"PeriodicalIF":4.3,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José Galeas , Alberto Tudela , Óscar Pons , Juan Pedro Bandera , Antonio Bandera , Pablo Bustos
{"title":"CRDT-based knowledge synchronisation in an Internet of Robotics Things ecosystem for Ambient Assisted Living","authors":"José Galeas , Alberto Tudela , Óscar Pons , Juan Pedro Bandera , Antonio Bandera , Pablo Bustos","doi":"10.1016/j.cviu.2025.104437","DOIUrl":"10.1016/j.cviu.2025.104437","url":null,"abstract":"<div><div>Integrating IoT and assistive robots in the design of Ambient Assisted Living (AAL) frameworks has proven to be a useful solution for monitoring and assisting elderly people at home. As a way to manage the information captured and assess the person’s condition, respond to emergencies, promote physical or cognitive exercises, etc., these systems can also integrate a Virtual Caregiver (VC). Given the diversity of technologies deployed in such an AAL framework, deciding how to manage knowledge appropriately can be complex. This paper proposes to organise the AAL framework as a distributed system, i.e., as a collection of autonomous software agents that provide users with a single coherent response. In this distributed system, agents are deployed locally and handle replicas of the knowledge model. The problem of merging these replicas into a consistent representation, therefore arises.The <span><math><mi>δ</mi></math></span>-CRDT (Conflict-free Replicated Data Type) synchronisation mechanism is employed to ensure the eventual consistency with low communication overhead. To manage the dynamics of the AAL ecosystem, the <span><math><mi>δ</mi></math></span>-CRDT is combined with the publish/subscribe interaction protocol. In this way, the performance of the IoT, the robot and the VC, through the functionalities that depend on them, is efficiently adapted to changes in the context. To demonstrate the validity of the proposal, two use cases have been designed in which a collaborative response from the system is required. The first one deals with a possible fall of the user at home, while the second one deals with the problem of helping the person move small objects around the flat. The measured values of latency or consistency in the data show that the proposal works satisfactorily.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104437"},"PeriodicalIF":4.3,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UM-Mamba: An efficient U-network with medical visual state space for medical image segmentation","authors":"Hejian Chen , Qing Liu , Zhongming Fu, Li Liu","doi":"10.1016/j.cviu.2025.104436","DOIUrl":"10.1016/j.cviu.2025.104436","url":null,"abstract":"<div><div>Designing computationally efficient network architectures remains a persistent necessity in medical image segmentation. Lately, State Space Models (SSMs) are emerging in the field of deep learning and gradually becoming effective basic building layers (or blocks) for constructing deep networks. SSMs not only effectively capture long-distance dependencies but also maintain linear computational complexity relative to input sizes. However, the non-sequential structure of 2D images limits its application in visual tasks. To solve this problem, this paper designs a Medical Visual State Space (MVSS) block with 2D Spiral Selective Scanning (SSS2D) module as the core, and constructs a U-shaped medical image segmentation network called UM-Mamba. The SSS2D module traverses the samples through four spiral scanning paths, which makes up for the deficiency of Mamba architecture in the non-sequential structure of 2D images. We conduct experiments on the Kvasir-SEG and ISIC2018 datasets, and achieve the best results in Dice, IoU and MAE by fine-tuning, which proves that UM-Mamba has the leading level in the experimental datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104436"},"PeriodicalIF":4.3,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144513569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}