Yousaf Albaluchi , Biying Fu , Naser Damer , Raghavendra Ramachandra , Kiran Raja
{"title":"UAV-based person re-identification: A survey of UAV datasets, approaches, and challenges","authors":"Yousaf Albaluchi , Biying Fu , Naser Damer , Raghavendra Ramachandra , Kiran Raja","doi":"10.1016/j.cviu.2024.104261","DOIUrl":"10.1016/j.cviu.2024.104261","url":null,"abstract":"<div><div>Person re-identification (ReID) has gained significant interest due to growing public safety concerns that require advanced surveillance and identification mechanisms. While most existing ReID research relies on static surveillance cameras, the use of Unmanned Aerial Vehicles (UAVs) for surveillance has recently gained popularity. Noting the promising application of UAVs in ReID, this paper presents a comprehensive overview of UAV-based ReID, highlighting publicly available datasets, key challenges, and methodologies. We summarize and consolidate evaluations conducted across multiple studies, providing a unified perspective on the state of UAV-based ReID research. Despite their limited size and diversity, We underscore current datasets’ importance in advancing UAV-based ReID research. The survey also presents a list of all available approaches for UAV-based ReID. The survey presents challenges associated with UAV-based ReID, including environmental conditions, image quality issues, and privacy concerns. We discuss dynamic adaptation techniques, multi-model fusion, and lightweight algorithms to leverage ground-based person ReID datasets for UAV applications. Finally, we explore potential research directions, highlighting the need for diverse datasets, lightweight algorithms, and innovative approaches to tackle the unique challenges of UAV-based person ReID.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104261"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Zhang , Jingcheng Yu , Zhicheng Zhang , Congyao Zheng , Yao Le , Yunsong Li
{"title":"MASK_LOSS guided non-end-to-end image denoising network based on multi-attention module with bias rectified linear unit and absolute pooling unit","authors":"Jing Zhang , Jingcheng Yu , Zhicheng Zhang , Congyao Zheng , Yao Le , Yunsong Li","doi":"10.1016/j.cviu.2025.104302","DOIUrl":"10.1016/j.cviu.2025.104302","url":null,"abstract":"<div><div>Deep learning-based image denoising algorithms have demonstrated superior denoising performance but suffer from loss of details and excessive smoothing of edges after denoising. In addition, these denoising models often involve redundant calculations, resulting in low utilization rates and poor generalization capabilities. To address these challenges, we proposes an Non-end-to-end Multi-Attention Denoising Network (N-ete MADN). Firstly, we propose a Bias Rectified Linear Unit (BReLU) to replace ReLU as the activation function, which provides enhanced flexibility and expanded activation range without additional computation, constructing a Feature Extraction Unit (FEU) with depth-wise convolutions (DConv). Then an Absolute Pooling Unit (AbsPooling-unit) is proposed to consist Channel Attention Block(CAB), Spatial Attention Block(SAB) and Channel Similarity Attention Block (CSAB) , which are integrated into a Multi-Attention Module (MAM). CAB and SAB aim to enhance the model’s focus on key information respectively in the spatial dimension and the channel dimension, while CSAB aims to improve the model’s ability to detect similar features. Finally, the MAM is utilized to construct a Multi-Attention Denoising Network (MADN). Then a mask loss function (MASK_LOSS) and a compound multi-stage denoising network called Non-end-to-end Multi-Attention Denoising Network (N-ete MADN) based on the loss and MADN are proposed, which aim to handle the image with rich edge information, providing enhanced protection for edges and facilitating the reconstruction of edge information after image denoising. This framework enhances learning capacity and efficiency, effectively addressing edge detail loss challenges in denoising tasks. Experimental results on both synthetic several datasets demonstrate that our model can achieve the state-of-the-art denoising performance with low computational costs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104302"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Collaborative Neural Painting","authors":"Nicola Dall’Asen , Willi Menapace , Elia Peruzzo , Enver Sangineto , Yiming Wang , Elisa Ricci","doi":"10.1016/j.cviu.2025.104298","DOIUrl":"10.1016/j.cviu.2025.104298","url":null,"abstract":"<div><div>The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, <em>Collaborative Neural Painting (CNP)</em>, to facilitate collaborative art painting generation between users and agents. Given any number of user-input <em>brushstrokes</em> as the context or just the desired <em>object class</em>, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users’ modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research. Project page and code: <span><span>this https URL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104298"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaetano Dibenedetto , Stefanos Sotiropoulos , Marco Polignano , Giuseppe Cavallo , Pasquale Lops
{"title":"Comparing Human Pose Estimation through deep learning approaches: An overview","authors":"Gaetano Dibenedetto , Stefanos Sotiropoulos , Marco Polignano , Giuseppe Cavallo , Pasquale Lops","doi":"10.1016/j.cviu.2025.104297","DOIUrl":"10.1016/j.cviu.2025.104297","url":null,"abstract":"<div><div>In the everyday IoT ecosystem, many devices and systems are interconnected in an intelligent living environment to create a comfortable and efficient living space. In this scenario, approaches based on automatic recognition of actions and events can support fully autonomous digital assistants and personalized services. A pivotal component in this domain is “Human Pose Estimation”, which plays a critical role in action recognition for a wide range of applications, including home automation, healthcare, safety, and security. These systems are designed to detect human actions and deliver customized real-time responses and support. Selecting an appropriate technique for Human Pose Estimation is crucial to enhancing these systems for various applications. This choice hinges on the specific environment and can be categorized on the basis of whether the technique is designed for images or videos, single-person or multi-person scenarios, and monocular or multiview inputs. A comprehensive overview of recent research outcomes is essential to showcase the evolution of the research area, along with its underlying principles and varied application domains. Key benchmarks across these techniques are suitable and provide valuable insights into their performance. Hence, the paper summarizes these benchmarks, offering a comparative analysis of the techniques. As research in this field continues to evolve, it is critical for researchers to stay up to date with the latest developments and methodologies to promote further innovations in the field of pose estimation research. Therefore, this comprehensive overview presents a thorough examination of the subject matter, encompassing all pertinent details. Its objective is to equip researchers with the knowledge and resources necessary to investigate the topic and effectively retrieve all relevant information necessary for their investigations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104297"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Building extraction from remote sensing images with deep learning: A survey on vision techniques","authors":"Yuan Yuan, Xiaofeng Shi, Junyu Gao","doi":"10.1016/j.cviu.2024.104253","DOIUrl":"10.1016/j.cviu.2024.104253","url":null,"abstract":"<div><div>Building extraction from remote sensing images is a hot topic in the fields of computer vision and remote sensing. In recent years, driven by deep learning, the accuracy of building extraction has been improved significantly. This survey offers a review of recent deep learning-based building extraction methods, systematically covering concepts like representation learning, efficient data utilization, multi-source fusion, and polygonal outputs, which have been rarely addressed in previous surveys comprehensively, thereby complementing existing research. Specifically, we first briefly introduce the relevant preliminaries and the challenges of building extraction with deep learning. Then we construct a systematic and instructive taxonomy from two perspectives: (1) representation and learning-oriented perspective and (2) input and output-oriented perspective. With this taxonomy, the recent building extraction methods are summarized. Furthermore, we introduce the key attributes of extensive publicly available benchmark datasets, the performance of some state-of-the-art models and the free-available products. Finally, we prospect the future research directions from three aspects.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104253"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From bias to balance: Leverage representation learning for bias-free MoCap solving","authors":"Georgios Albanis , Nikolaos Zioulis , Spyridon Thermos , Anargyros Chatzitofis , Kostas Kolomvatsos","doi":"10.1016/j.cviu.2024.104241","DOIUrl":"10.1016/j.cviu.2024.104241","url":null,"abstract":"<div><div>Motion Capture (MoCap) is still dominated by optical MoCap as it remains the gold standard. However, the raw captured data even from such systems suffer from high-frequency noise and errors sourced from ghost or occluded markers. To that end, a post-processing step is often required to clean up the data, which is typically a tedious and time-consuming process. Some studies tried to address these issues in a data-driven manner, leveraging the availability of MoCap data. However, there is a high-level data redundancy in such data, as the motion cycle is usually comprised of similar poses (e.g. standing still). Such redundancies affect the performance of those methods, especially in the rarer poses. In this work, we address the issue of long-tailed data distribution by leveraging representation learning. We introduce a novel technique for imbalanced regression that does not require additional data or labels. Our approach uses a Mahalanobis distance-based method for automatically identifying rare samples and properly reweighting them during training, while at the same time, we employ high-order interpolation algorithms to effectively sample the latent space of a Variational Autoencoder (VAE) to generate new tail samples. We prove that the proposed approach can significantly improve the results, especially in the tail samples, while at the same time is a model-agnostic method and can be applied across various architectures.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104241"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nonlocal Gaussian scale mixture modeling for hyperspectral image denoising","authors":"Ling Ding, Qiong Wang, Yin Poo, Xinggan Zhang","doi":"10.1016/j.cviu.2024.104270","DOIUrl":"10.1016/j.cviu.2024.104270","url":null,"abstract":"<div><div>Recent nonlocal sparsity methods have gained significant attention in hyperspectral image (HSI) denoising. These methods leverage the nonlocal self-similarity (NSS) prior to group similar full-band patches into nonlocal full-band groups, followed by enforcing a sparsity constraint, usually through soft-thresholding or hard-thresholding operators, on each nonlocal full-band group. However, in these methods, given that real HSI data are non-stationary and affected by noise, the variances of the sparse coefficients are unknown and challenging to accurately estimate from the degraded HSI, leading to suboptimal denoising performance. In this paper, we propose a novel nonlocal Gaussian scale mixture (NGSM) approach for HSI denoising, which significantly enhances the estimation accuracy of both the variances of the sparse coefficients and the unknown sparse coefficients. To reduce spectral redundancy, a global spectral low-rank (LR) prior is integrated with the NGSM model and consolidated into a variational framework for optimization. Extensive experimental results demonstrate that the proposed NGSM algorithm achieves convincing improvements over many state-of-the-art HSI denoising methods, both in quantitative and visual evaluations, while offering exceptional computational efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104270"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic
{"title":"ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition","authors":"Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic","doi":"10.1016/j.cviu.2024.104269","DOIUrl":"10.1016/j.cviu.2024.104269","url":null,"abstract":"<div><div>Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the <span>aselmar</span> framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. <span>aselmar</span> (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied <span>aselmar</span> to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean <span>ap</span> score metric. The label verification required by <span>aselmar</span> was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean <span>ap</span> score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, <span>aselmar</span> achieved a decrease in <span>ap</span> score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104269"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ikram Kourbane, Panagiotis Papadakis, Mihai Andries
{"title":"SSL-Rehab: Assessment of physical rehabilitation exercises through self-supervised learning of 3D skeleton representations","authors":"Ikram Kourbane, Panagiotis Papadakis, Mihai Andries","doi":"10.1016/j.cviu.2024.104275","DOIUrl":"10.1016/j.cviu.2024.104275","url":null,"abstract":"<div><div>Rehabilitation aims to assist individuals in recovering or enhancing functions that have been lost or impaired due to injury, illness, or disease. The automatic assessment of physical rehabilitation exercises offers a valuable method for patient supervision, complementing or potentially substituting traditional clinical evaluations. However, acquiring large-scale annotated datasets presents challenges, prompting the need for self-supervised learning and transfer learning in the rehabilitation domain. Our proposed approach integrates these two strategies through Low-Rank Adaptation (LoRA) for both pretraining and fine-tuning. Specifically, we train a foundation model to learn robust 3D skeleton features that adapt to varying levels of masked motion complexity through a three-stage process. In the first stage, we apply a high masking ratio to a subset of joints, using a transformer-based architecture with a graph embedding layer to capture fundamental motion features. In the second stage, we reduce the masking ratio and expand the model’s capacity to learn more intricate motion patterns and interactions between joints. Finally, in the third stage, we further lower the masking ratio to enable the model to refine its understanding of detailed motion dynamics, optimizing its overall performance. During the second and third stages, LoRA layers are incorporated to extract unique features tailored to each masking level, ensuring efficient adaptation without significantly increasing the model size. Fine-tuning for downstream tasks shows that the model performs better when different masked motion levels are utilized. Through extensive experiments conducted on the publicly available KIMORE and UI-PRMD datasets, we demonstrate the effectiveness of our approach in accurately evaluating the execution quality of rehabilitation exercises, surpassing state-of-the-art performance across all metrics. <span><span>Our project page is available online</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104275"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei
{"title":"RelFormer: Advancing contextual relations for transformer-based dense captioning","authors":"Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei","doi":"10.1016/j.cviu.2025.104300","DOIUrl":"10.1016/j.cviu.2025.104300","url":null,"abstract":"<div><div>Dense captioning aims to detect regions in images and generate natural language descriptions for each identified region. For this task, contextual modeling is crucial for generating accurate descriptions since regions in the image could interact with each other. Previous efforts primarily focused on the modeling between categorized object regions, which are extracted by pre-trained object detectors, <em>e.g</em>., Fast R-CNN. However, they overlook the contextual modeling for non-object regions, <em>e.g</em>., sky, rivers, and grass, commonly referred to as “stuff”. In this paper, we propose the RelFormer framework to enhance the contextual relation modeling of Transformer-based dense captioning. Specifically, we design a clip-assisted region feature extraction module to extract rich contextual features of regions, involving stuff regions. We then introduce a straightforward relation encoder based on self-attention to effectively model relationships between regional features. To accurately extract candidate regions in dense images while minimizing redundant proposals, we further introduce the amplified decay non-maximum-suppression, which amplifies the decay degree of the redundant proposals so that they can be removed while reserving the detection of the small regions under a low confidence threshold. The experimental results indicate that by enhancing contextual interactions, our model exhibits a good understanding of regions and attains state-of-the-art performance on dense captioning tasks. Our method achieves 17.52% mAP on VG V1.0, 16.59% on VG V1.2, and 15.49% on VG-COCO. Code is available at <span><span>https://github.com/Wykay/Relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104300"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}