Dawid Malarz , Weronika Smolak-Dyżewska , Jacek Tabor , Sławomir Tadeja , Przemysław Spurek
{"title":"Gaussian Splatting with NeRF-based color and opacity","authors":"Dawid Malarz , Weronika Smolak-Dyżewska , Jacek Tabor , Sławomir Tadeja , Przemysław Spurek","doi":"10.1016/j.cviu.2024.104273","DOIUrl":"10.1016/j.cviu.2024.104273","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRFs) have demonstrated the remarkable potential of neural networks to capture the intricacies of 3D objects. NeRFs excel at producing strikingly sharp novel views of 3D objects by encoding the shape and color information within neural network weights. Recently, numerous generalizations of NeRFs utilizing generative models have emerged, expanding their versatility. In contrast, <em>Gaussian Splatting</em> (GS) offers a similar render quality with faster training and inference as it does not need neural networks to work. It encodes information about the 3D objects in the set of Gaussian distributions that can be rendered in 3D similarly to classical meshes. Unfortunately, GS is difficult to condition since its representation is fully explicit. To mitigate the caveats of both models, we propose a hybrid model <em>Viewing Direction Gaussian Splatting</em> (VDGS) that uses GS representation of the 3D object’s shape and NeRF-based encoding of opacity. Our model uses Gaussian distributions with trainable positions (i.e., means of Gaussian), shape (i.e., the covariance of Gaussian), opacity, and a neural network that takes Gaussian parameters and viewing direction to produce changes in the said opacity.As a result, our model better describes shadows, light reflections, and the transparency of 3D objects without adding additional texture and light components.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104273"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-based Moving Object Segmentation for underwater videos using semi-supervised learning","authors":"Meghna Kapoor , Wieke Prummel , Jhony H. Giraldo , Badri Narayan Subudhi , Anastasia Zakharova , Thierry Bouwmans , Ankur Bansal","doi":"10.1016/j.cviu.2025.104290","DOIUrl":"10.1016/j.cviu.2025.104290","url":null,"abstract":"<div><div>Moving object segmentation (MOS) using passive underwater image processing is an important technology for monitoring marine habitats. It aids marine biologists studying biological oceanography and the associated fields of chemical, physical, and geological oceanography to understand marine organisms. Dynamic backgrounds due to marine organisms like algae and seaweed, and improper illumination of the environment pose challenges in detecting moving objects in the scene. Previous graph-learning methods have shown promising results in MOS, but are mostly limited to terrestrial surface videos such as traffic video surveillance. Traditional object modeling fails in underwater scenes, due to fish shape and color degradation in motion and the lack of extensive underwater datasets for deep-learning models. Therefore, we propose a semi-supervised graph-learning approach (GraphMOS-U) to segment moving objects in underwater environments. Additionally, existing datasets were consolidated to form the proposed Teleost Fish Classification Dataset, specifically designed for fish classification tasks in complex environments to avoid unseen scenes, ensuring the replication of the transfer learning process on a ResNet-50 backbone. GraphMOS-U uses a six-step approach with transfer learning using Mask R-CNN and a ResNet-50 backbone for instance segmentation, followed by feature extraction using optical flow, visual saliency, and texture. After concatenating these features, a <span><math><mi>k</mi></math></span>-NN Graph is constructed, and graph node classification is applied to label objects as foreground or background. The foreground nodes are used to reconstruct the segmentation map of the moving object from the scene. Quantitative and qualitative experiments demonstrate that GraphMOS-U outperforms state-of-the-art algorithms, accurately detecting moving objects while preserving fine details. The proposed method enables the use of graph-based MOS algorithms in underwater scenes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104290"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Illumination-aware and structure-guided transformer for low-light image enhancement","authors":"Guodong Fan , Zishu Yao , Min Gan","doi":"10.1016/j.cviu.2024.104276","DOIUrl":"10.1016/j.cviu.2024.104276","url":null,"abstract":"<div><div>In this paper, we proposed a novel illumination-aware and structure-guided transformer that achieves efficient image enhancement by focusing on brightness degradation and precise high-frequency guidance. Specifically, low-light images often contain numerous regions with similar brightness levels but different spatial locations. However, existing attention mechanisms only compute self-attention using channel dimensions or fixed-size spatial blocks, which limits their ability to capture long-range features, making it challenging to achieve satisfactory image restoration quality. At the same time, the details of low-light images are mostly hidden in the darkness. However, existing models often give equal attention to both high-frequency and smooth regions, which makes it difficult to capture the details of deep degradation, resulting in blurry recovered image details. On the one hand, we introduced a dynamic brightness multi-domain self-attention mechanism that selectively focuses on spatial features within dynamic ranges and incorporates frequency domain information. This approach allows the model to capture both local details and global features, restoring global brightness while paying closer attention to regions with similar degradation. On the other hand, we proposed a global maximum gradient search strategy to guide the model’s attention towards high-frequency detail regions, thereby achieving a more fine-grained restored image. Extensive experiments on various benchmark datasets demonstrate that our method achieves state-of-the-art performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104276"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-domain conditional prior network for water-related optical image enhancement","authors":"Tianyu Wei , Dehuan Zhang , Zongxin He , Rui Zhou , Xiangfu Meng","doi":"10.1016/j.cviu.2024.104251","DOIUrl":"10.1016/j.cviu.2024.104251","url":null,"abstract":"<div><div>Water-related optical image enhancement improves the perception of information for human and machine vision, facilitating the development and utilization of marine resources. Due to the absorption and scattering of light in different water media, water-related optical images typically suffer from color distortion and low contrast. However, existing enhancement methods struggle to accurately simulate the imaging process in real underwater environments. To model and invert the degradation process of water-related optical images, we propose a Multi-domain Conditional Prior Network (MCPN) based on color vector prior and spectrum vector prior for enhancing water-related optical images. MCPN captures color, luminance, and structural priors across different feature spaces, resulting in a lightweight architecture that enhances water-related optical images while preserving critical information fidelity. Specifically, MCPN includes a modulated network, and a conditional network comprises two conditional units. The modulated network is a lightweight Convolutional Neural Network responsible for image reconstruction and local feature refinement. To avoid feature loss from multiple extractions, the Gaussian Conditional Unit (GCU) extracts atmospheric light and color shift information from the input image to form color prior vectors. Simultaneously, incorporating the Fast Fourier Transform, the Spectrum Conditional Unit (SCU) extracts scene brightness and structure to form spectrum prior vectors. These prior vectors are embedded into the modulated network to guide the image reconstruction. MCPN utilizes a PAL-based weighted Selective Supervision (PSS) strategy, selectively adjusting learning weights for images with excessive artificial noise. Experimental results demonstrate that MCPN outperforms existing methods, achieving excellent performance on the UIEB dataset. The PSS also shows fine feature matching in downstream applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104251"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shu Ye , Wenxin Huang , Wenxuan Liu , Liang Chen , Xiao Wang , Xian Zhong
{"title":"YES: You should Examine Suspect cues for low-light object detection","authors":"Shu Ye , Wenxin Huang , Wenxuan Liu , Liang Chen , Xiao Wang , Xian Zhong","doi":"10.1016/j.cviu.2024.104271","DOIUrl":"10.1016/j.cviu.2024.104271","url":null,"abstract":"<div><div>Object detection in low-light conditions presents substantial challenges, particularly the issue we define as “low-light object-background cheating”. This phenomenon arises from uneven lighting, leading to blurred and inaccurate object edges. Most existing methods focus on basic feature enhancement and addressing the gap between normal-light and synthetic low-light conditions. However, they often overlook the complexities introduced by uneven lighting in real-world environments. To address this, we propose a novel low-light object detection framework, You Examine Suspect (YES), comprising two key components: the Optical Balance Enhancer (OBE) and the Entanglement Attenuation Module (EAM). The OBE emphasizes “balance” by employing techniques such as inverse tone mapping, white balance, and gamma correction to recover details in dark regions while adjusting brightness and contrast without introducing noise. The EAM focuses on “disentanglement” by analyzing both object regions and surrounding areas affected by lighting variations and integrating multi-scale contextual information to clarify ambiguous features. Extensive experiments on <span>ExDark</span> and <span>Dark Face</span> datasets demonstrate the superior performance of proposed YES, validating its effectiveness in low-light object detection tasks. The code will be available at <span><span>https://github.com/Regina971/YES</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104271"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorenzo Baraldi , Roberto Amoroso , Marcella Cornia , Lorenzo Baraldi , Andrea Pilzer , Rita Cucchiara
{"title":"Learning to mask and permute visual tokens for Vision Transformer pre-training","authors":"Lorenzo Baraldi , Roberto Amoroso , Marcella Cornia , Lorenzo Baraldi , Andrea Pilzer , Rita Cucchiara","doi":"10.1016/j.cviu.2025.104294","DOIUrl":"10.1016/j.cviu.2025.104294","url":null,"abstract":"<div><div>The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed <span><math><mi>k</mi></math></span>-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. We release an implementation of our code and models at <span><span>https://github.com/aimagelab/MaPeT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104294"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143097183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Building extraction from remote sensing images with deep learning: A survey on vision techniques","authors":"Yuan Yuan, Xiaofeng Shi, Junyu Gao","doi":"10.1016/j.cviu.2024.104253","DOIUrl":"10.1016/j.cviu.2024.104253","url":null,"abstract":"<div><div>Building extraction from remote sensing images is a hot topic in the fields of computer vision and remote sensing. In recent years, driven by deep learning, the accuracy of building extraction has been improved significantly. This survey offers a review of recent deep learning-based building extraction methods, systematically covering concepts like representation learning, efficient data utilization, multi-source fusion, and polygonal outputs, which have been rarely addressed in previous surveys comprehensively, thereby complementing existing research. Specifically, we first briefly introduce the relevant preliminaries and the challenges of building extraction with deep learning. Then we construct a systematic and instructive taxonomy from two perspectives: (1) representation and learning-oriented perspective and (2) input and output-oriented perspective. With this taxonomy, the recent building extraction methods are summarized. Furthermore, we introduce the key attributes of extensive publicly available benchmark datasets, the performance of some state-of-the-art models and the free-available products. Finally, we prospect the future research directions from three aspects.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104253"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From bias to balance: Leverage representation learning for bias-free MoCap solving","authors":"Georgios Albanis , Nikolaos Zioulis , Spyridon Thermos , Anargyros Chatzitofis , Kostas Kolomvatsos","doi":"10.1016/j.cviu.2024.104241","DOIUrl":"10.1016/j.cviu.2024.104241","url":null,"abstract":"<div><div>Motion Capture (MoCap) is still dominated by optical MoCap as it remains the gold standard. However, the raw captured data even from such systems suffer from high-frequency noise and errors sourced from ghost or occluded markers. To that end, a post-processing step is often required to clean up the data, which is typically a tedious and time-consuming process. Some studies tried to address these issues in a data-driven manner, leveraging the availability of MoCap data. However, there is a high-level data redundancy in such data, as the motion cycle is usually comprised of similar poses (e.g. standing still). Such redundancies affect the performance of those methods, especially in the rarer poses. In this work, we address the issue of long-tailed data distribution by leveraging representation learning. We introduce a novel technique for imbalanced regression that does not require additional data or labels. Our approach uses a Mahalanobis distance-based method for automatically identifying rare samples and properly reweighting them during training, while at the same time, we employ high-order interpolation algorithms to effectively sample the latent space of a Variational Autoencoder (VAE) to generate new tail samples. We prove that the proposed approach can significantly improve the results, especially in the tail samples, while at the same time is a model-agnostic method and can be applied across various architectures.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104241"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yousaf Albaluchi , Biying Fu , Naser Damer , Raghavendra Ramachandra , Kiran Raja
{"title":"UAV-based person re-identification: A survey of UAV datasets, approaches, and challenges","authors":"Yousaf Albaluchi , Biying Fu , Naser Damer , Raghavendra Ramachandra , Kiran Raja","doi":"10.1016/j.cviu.2024.104261","DOIUrl":"10.1016/j.cviu.2024.104261","url":null,"abstract":"<div><div>Person re-identification (ReID) has gained significant interest due to growing public safety concerns that require advanced surveillance and identification mechanisms. While most existing ReID research relies on static surveillance cameras, the use of Unmanned Aerial Vehicles (UAVs) for surveillance has recently gained popularity. Noting the promising application of UAVs in ReID, this paper presents a comprehensive overview of UAV-based ReID, highlighting publicly available datasets, key challenges, and methodologies. We summarize and consolidate evaluations conducted across multiple studies, providing a unified perspective on the state of UAV-based ReID research. Despite their limited size and diversity, We underscore current datasets’ importance in advancing UAV-based ReID research. The survey also presents a list of all available approaches for UAV-based ReID. The survey presents challenges associated with UAV-based ReID, including environmental conditions, image quality issues, and privacy concerns. We discuss dynamic adaptation techniques, multi-model fusion, and lightweight algorithms to leverage ground-based person ReID datasets for UAV applications. Finally, we explore potential research directions, highlighting the need for diverse datasets, lightweight algorithms, and innovative approaches to tackle the unique challenges of UAV-based person ReID.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104261"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Zhang , Jingcheng Yu , Zhicheng Zhang , Congyao Zheng , Yao Le , Yunsong Li
{"title":"MASK_LOSS guided non-end-to-end image denoising network based on multi-attention module with bias rectified linear unit and absolute pooling unit","authors":"Jing Zhang , Jingcheng Yu , Zhicheng Zhang , Congyao Zheng , Yao Le , Yunsong Li","doi":"10.1016/j.cviu.2025.104302","DOIUrl":"10.1016/j.cviu.2025.104302","url":null,"abstract":"<div><div>Deep learning-based image denoising algorithms have demonstrated superior denoising performance but suffer from loss of details and excessive smoothing of edges after denoising. In addition, these denoising models often involve redundant calculations, resulting in low utilization rates and poor generalization capabilities. To address these challenges, we proposes an Non-end-to-end Multi-Attention Denoising Network (N-ete MADN). Firstly, we propose a Bias Rectified Linear Unit (BReLU) to replace ReLU as the activation function, which provides enhanced flexibility and expanded activation range without additional computation, constructing a Feature Extraction Unit (FEU) with depth-wise convolutions (DConv). Then an Absolute Pooling Unit (AbsPooling-unit) is proposed to consist Channel Attention Block(CAB), Spatial Attention Block(SAB) and Channel Similarity Attention Block (CSAB) , which are integrated into a Multi-Attention Module (MAM). CAB and SAB aim to enhance the model’s focus on key information respectively in the spatial dimension and the channel dimension, while CSAB aims to improve the model’s ability to detect similar features. Finally, the MAM is utilized to construct a Multi-Attention Denoising Network (MADN). Then a mask loss function (MASK_LOSS) and a compound multi-stage denoising network called Non-end-to-end Multi-Attention Denoising Network (N-ete MADN) based on the loss and MADN are proposed, which aim to handle the image with rich edge information, providing enhanced protection for edges and facilitating the reconstruction of edge information after image denoising. This framework enhances learning capacity and efficiency, effectively addressing edge detail loss challenges in denoising tasks. Experimental results on both synthetic several datasets demonstrate that our model can achieve the state-of-the-art denoising performance with low computational costs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104302"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}