Computer Vision and Image Understanding最新文献_第6页

S2DNet: A self-supervised deraining network using monocular videos S2DNet：一个使用单目视频的自监督训练网络

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-10 DOI: 10.1016/j.cviu.2025.104444

Aman Kumar, Aditya Mohan, A.N. Rajagopalan

{"title":"S2DNet: A self-supervised deraining network using monocular videos","authors":"Aman Kumar, Aditya Mohan, A.N. Rajagopalan","doi":"10.1016/j.cviu.2025.104444","DOIUrl":"10.1016/j.cviu.2025.104444","url":null,"abstract":"<div><div>Rainy conditions degrade the visual quality of images, thus presenting significant challenges for various vision-based downstream tasks. Traditional deraining approaches often rely on supervised learning methods requiring large, paired datasets of rainy and clean images. However, due to the dynamic and complex nature of rain, compiling such datasets is challenging and often insufficient for training robust models. As a result, researchers often resort to synthetic datasets. However, synthetic datasets have limitations because they often lack realism, can introduce biases, and seldom capture the diversity of real rain scenes. We propose a self-supervised method for image deraining using monocular videos that leverages the fact that rain moves spatially across frames, independently of the static elements in a scene, thus enabling isolation of rain-affected regions. We utilize depth information from the target frame and the camera’s relative pose (translations and rotations) across frames to achieve scene alignment. We apply a view-synthesis constraint that warps features from adjacent frames to the target frame, which enables us to generate pseudo-ground truth images by selecting clean pixels from the warped frame. The pseudo-clean images thus generated are effectively leveraged by our network to remove rain from images in a self-supervised manner without the need for a real rain paired dataset which is difficult to capture. Extensive evaluations on diverse real-world rainy datasets demonstrate that our approach achieves state-of-the-art performance in real image deraining, outperforming existing unsupervised methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104444"},"PeriodicalIF":4.3,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144604598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning 多模态联邦学习的视觉感知潜在提示生成

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-08 DOI: 10.1016/j.cviu.2025.104442

Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu

{"title":"FedVLP: Visual-aware latent prompt generation for Multimodal Federated Learning","authors":"Hao Pan , Xiaoli Zhao , Yuchen Jiang , Lipeng He , Bingquan Wang , Yincan Shu","doi":"10.1016/j.cviu.2025.104442","DOIUrl":"10.1016/j.cviu.2025.104442","url":null,"abstract":"<div><div>Recent studies indicate that prompt learning based on CLIP-like models excels in a variety of image recognition and detection tasks, consequently, it has been applied in Multimodal Federated Learning (MMFL). Federated Prompt Learning (FPL), as a technical branch of MMFL, enables clients and servers to exchange prompts rather than model parameters during communication to address challenges such as data heterogeneity and high training costs. Many existing FPL methods rely heavily on pre-trained visual-language models, making it difficult for them to handle new and real specialized domain data. To further boost the generalization ability of FPL without compromising the personalization of clients, we propose a novel framework that generates prompts guided by visual semantics to better handle specialized and small-scale data. In our approach, each client generates visual-aware latent prompts using a Fusion Encoder and an IE-Module, enabling the learning of fine-grained knowledge. Through federated computation, clients collaboratively maintain a global prompt, allowing the learning of coarse-grained knowledge. FedVLP removes the dependency on manually designed prompt templates and demonstrates superior performance across seven datasets, including CIFAR-10, CIFAR-100, Caltech-101, FLIndustry-100, and others.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104442"},"PeriodicalIF":4.3,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144595957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PDCNet: A lightweight and efficient robotic grasp detection framework via Partial Convolution and knowledge distillation PDCNet：一个基于部分卷积和知识蒸馏的轻量级、高效的机器人抓取检测框架

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-07 DOI: 10.1016/j.cviu.2025.104441

Yanshu Jiang, Yanze Fang, Liwei Deng

{"title":"PDCNet: A lightweight and efficient robotic grasp detection framework via Partial Convolution and knowledge distillation","authors":"Yanshu Jiang, Yanze Fang, Liwei Deng","doi":"10.1016/j.cviu.2025.104441","DOIUrl":"10.1016/j.cviu.2025.104441","url":null,"abstract":"<div><div>Improving detection accuracy complicates robotic grasp models, which makes deploying them on resource-constrained edge AI devices more challenging. Although various lightweight strategies have been proposed, directly designing compact networks may not be optimal, as balancing accuracy and model size is challenging. This paper proposes a lightweight grasp detection framework, PDCNet. In response to this problem, we optimize the interplay between computational demands and detection performance. The method integrates Partial Convolution (PConv) for efficient feature extraction, Discrete Wavelet Transform (DWT) for enhancing frequency-domain feature representation, and a Cross-Stage Fusion (CSF) strategy for optimizing the utilization of multi-scale features. A Quality-Enhanced Huber Loss Function (Q-Huber) is also introduced to improve the network’s sensitivity to vital grasp localities. Finally, the teacher–student framework distills expertise into a compact student model. Comprehensive evaluations were conducted using the public datasets to demonstrate that PDCNet achieves detection accuracies of 98.7%, 95.8%, and 97.1% on Cornell, Jacquard and Jacquard_V2 datasets respectively, while maintaining minimal parameters and high computational efficiency. Real-world experiments on an embedded edge AI device further validate the capability of PDCNet to perform accurate grasp detection under limited computational resources.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104441"},"PeriodicalIF":4.3,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144604599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A method for absolute pose regression based on cascaded attention modules 基于级联注意模块的绝对姿态回归方法

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-05 DOI: 10.1016/j.cviu.2025.104440

Xiaogang Song , Junjie Tang , Kaixuan Yang , Weixuan Guo , Xiaofeng Lu , Xinhong Hei

{"title":"A method for absolute pose regression based on cascaded attention modules","authors":"Xiaogang Song , Junjie Tang , Kaixuan Yang , Weixuan Guo , Xiaofeng Lu , Xinhong Hei","doi":"10.1016/j.cviu.2025.104440","DOIUrl":"10.1016/j.cviu.2025.104440","url":null,"abstract":"<div><div>The absolute camera pose regression estimates the position and orientation of the camera solely based on captured RGB images. However, current single-image techniques often lack robustness, resulting in significant outliers. To address the issues of pose regressors in repetitive textures and dynamic blur scenarios, this paper proposes an absolute pose regression method based on cascaded attention modules. This network integrates global and local information through cascaded attention modules and then employs a dual-stream attention module to reduce the impact of dynamic objects and lighting changes on localization performance by constructing dual-channel dependencies. Specifically, the cascaded attention modules guide the model to focus on the relationships between global and local features and establish long-range channel dependencies, enabling the network to learn richer multi-scale feature representations. Additionally, a dual-stream attention module is introduced to further enhance feature representation by closely associating spatial and channel dimensions. This method is evaluated and analyzed on various indoor and outdoor datasets, with our method reducing the median position error and orientation error to 0.19 m/<span><math><mrow><mn>7</mn><mo>.</mo><mn>44</mn><mo>°</mo></mrow></math></span> on 7-Scenes and 7.09 m/<span><math><mrow><mn>1</mn><mo>.</mo><mn>45</mn><mo>°</mo></mrow></math></span> on RobotCar, demonstrating that the proposed method can significantly improve localization performance. Ablation studies on multiple categories further verify the effectiveness of the proposed modules.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104440"},"PeriodicalIF":4.3,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144595956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AdaptDiff: Adaptive diffusion learning for low-light image enhancement AdaptDiff：用于弱光图像增强的自适应扩散学习

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-04 DOI: 10.1016/j.cviu.2025.104439

Xiaotao Shao , Guipeng Zhang , Yan Shen , Boyu Zhang , Zhongli Wang , Yanlong Sun

{"title":"AdaptDiff: Adaptive diffusion learning for low-light image enhancement","authors":"Xiaotao Shao , Guipeng Zhang , Yan Shen , Boyu Zhang , Zhongli Wang , Yanlong Sun","doi":"10.1016/j.cviu.2025.104439","DOIUrl":"10.1016/j.cviu.2025.104439","url":null,"abstract":"<div><div>Recovering details obscured by noise from low-light images is a challenging task. Recent diffusion models have achieved relatively promising results in low-level vision tasks. However, there are still two issues: (1) under non-uniform illumination conditions, the low-light image cannot be restored with high quality, and (2) the models have limited generalization capabilities. To solve these problems, this paper proposes an Adaptive Enhancement Algorithm guided by a Multi-scale Structural Diffusion (AdaptDiff). AdaptDiff employs adaptive high-order mapping curves (AHMC) for pixel-by-pixel mapping of the image during the diffusion process, thereby adjusting the brightness levels between different regions within the image. In addition, a multi-scale structural guidance approach (MSGD) is proposed as an implicit bias, informing the intermediate layers of the model about the structural characteristics of the image, facilitating more effective restoration of clear images. Guiding the diffusion direction through structural information is conducive to maintaining good performance of the model even when faced with data that it has not previously encountered. Extensive experiments on popular benchmarks show that AdaptDiff achieves superior performance and efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104439"},"PeriodicalIF":4.3,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144595955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distribution-aware contrastive learning for domain adaptation in 3D LiDAR segmentation 三维激光雷达分割中区域自适应的分布感知对比学习

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-03 DOI: 10.1016/j.cviu.2025.104438

Lamiae El Mendili, Sylvie Daniel, Thierry Badard

引用次数: 0

Continuous hand gesture recognition: Benchmarks and methods 连续手势识别：基准和方法

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-02 DOI: 10.1016/j.cviu.2025.104435

Marco Emporio , Amirpouya Ghasemaghaei , Joseph J. Laviola Jr. , Andrea Giachetti

引用次数: 0

Rethinking the sparse mask learning mechanism in sparse convolution for object detection on drone images 无人机图像稀疏卷积中稀疏掩模学习机制的再思考

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-01 DOI: 10.1016/j.cviu.2025.104432

Yixuan Li , Pengnian Wu , Meng Zhang

{"title":"Rethinking the sparse mask learning mechanism in sparse convolution for object detection on drone images","authors":"Yixuan Li , Pengnian Wu , Meng Zhang","doi":"10.1016/j.cviu.2025.104432","DOIUrl":"10.1016/j.cviu.2025.104432","url":null,"abstract":"<div><div>Although sparse convolutional neural networks have achieved significant progress in fast object detection on high-resolution drone images, the research community has yet to pay enough attention to the great potential of prior knowledge (i.e., local contextual information) in UAV imagery for assisting sparse masks to improve detector performance. Such prior knowledge is beneficial for object detection in complex drone imagery, as tiny objects may be mistakenly detected or even missed entirely without referencing the local context surrounding them. In this paper, we take these priors into account and propose a crucial region learning strategy for sparse masks to boost object detection performance. Specifically, we extend the mask region from the feature region of the objects to their surrounding local context region and introduce a method for selecting and evaluating this local context region. Furthermore, we propose a novel mask-matching constraint to replace the mask activation ratio constraint, thereby enhancing object localization accuracy. We extensively evaluate our method across various detectors on two UAV benchmarks: VisDrone and UAVDT. By leveraging our mask learning strategy, the state-of-the-art sparse convolutional framework achieves higher detection gains with a faster detection speed, demonstrating its significant superiority.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104432"},"PeriodicalIF":4.3,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-graph meta matching correction for noisy graph matching 噪声图匹配的交叉图元匹配校正

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-07-01 DOI: 10.1016/j.cviu.2025.104433

Fangkai Li , Feiyu Pan , Wenjia Meng , Haoliang Sun , Xiushan Nie , Yilong Yin , Xiankai Lu

{"title":"Cross-graph meta matching correction for noisy graph matching","authors":"Fangkai Li , Feiyu Pan , Wenjia Meng , Haoliang Sun , Xiushan Nie , Yilong Yin , Xiankai Lu","doi":"10.1016/j.cviu.2025.104433","DOIUrl":"10.1016/j.cviu.2025.104433","url":null,"abstract":"<div><div>In recent years, significant advancements have been made in image feature point matching within the context of deep graph matching. However, keypoint annotations in images can be inaccurate due to various issues such as occlusion, changes in viewpoint, or poor recognizability, leading to noisy correspondence. To address this limitation, we propose a novel Meta Matching Correction for noisy Graph Matching (MCGM), which introduces meta-learning to mitigate noisy correspondence for the first time. Specifically, we design a Meta Correcting Network (MCN) that integrates global features and geometric consistency information of graphs to generate confidence scores for nodes and edges. Based on the scores, MCN adaptively adjusts and penalizes the noisy assignments, enhancing the model’s ability to handle noisy correspondence. We conduct joint training of the main network and MCN to achieve dynamic correction through a bi-level optimization framework. Experimental evaluations on three public benchmark datasets demonstrate that our proposed method delivers robust performance improvements over state-of-the-art graph matching solutions and exhibits excellent stability when handling images under complex conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104433"},"PeriodicalIF":4.3,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SPKDB-Net: A Salient-Part Pose Keypoints-Based Dual-Branch Network for repetitive action counting SPKDB-Net：一种基于显著部分姿态关键点的重复动作计数双分支网络

IF 4.3 3区计算机科学

Computer Vision and Image Understanding Pub Date : 2025-06-30 DOI: 10.1016/j.cviu.2025.104434

Jinying Wu , Jun Li , Qiming Li

{"title":"SPKDB-Net: A Salient-Part Pose Keypoints-Based Dual-Branch Network for repetitive action counting","authors":"Jinying Wu , Jun Li , Qiming Li","doi":"10.1016/j.cviu.2025.104434","DOIUrl":"10.1016/j.cviu.2025.104434","url":null,"abstract":"<div><div>With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, the existing pose-level methods have some drawbacks, for example, ignoring the fact that occlusion and unfavourable viewing angles in videos lead to affect the accuracy of pose keypoints extraction. To overcome these problems, we propose a simple but efficient Salient-Part Pose Keypoints-Based Dual-Branch Network (SPKDB-Net). Specifically, we design a dual-branch input channel consisting of a global-based and a salient-part input branch. The global-based input branch is used to input the pose keypoints of the whole body extracted by the human pose estimation network, and the salient-part input branch is used to input the salient-part pose keypoints (<em>i.e.</em>, head, shoulders, and hands). The second branch acts as an auxiliary to the first branch, thus effectively addressing the influence of external factors. In addition, we propose a DFEPM-Module that obtains long-distance dependency between pose keypoints through the attention mechanism, and obtains salient local features fused by the attention mechanism through convolution. Eventually, extensive experiments on the challenging RepCount-pose, UCFRep-pose and Countix-Fitness-pose benchmarks show that our proposed SPKDB-Net achieves state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104434"},"PeriodicalIF":4.3,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144522146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0