Dung Truong , Quang Nguyen , Khanh-Duy Nguyen , Tam V. Nguyen , Khang Nguyen
{"title":"Small object detection in aerial traffic imagery: A benchmark for motorbike-dominated road scenes","authors":"Dung Truong , Quang Nguyen , Khanh-Duy Nguyen , Tam V. Nguyen , Khang Nguyen","doi":"10.1016/j.jvcir.2025.104603","DOIUrl":"10.1016/j.jvcir.2025.104603","url":null,"abstract":"<div><div>Unmanned Aerial Vehicles (UAVs) have become indispensable for traffic monitoring, urban planning, and disaster management, particularly in high-density traffic environments like those in Southeast Asia. Vietnamese traffic, characterized by its high density of compact vehicles and unconventional patterns, poses unique challenges for object detection systems. Moreover, UAV imagery introduces additional complexities, such as variable object orientations and high-density scenes, which existing algorithms struggle to handle effectively. In this paper, we present two novel UAV datasets, UIT-Drone4 and UIT-Drone7 with 4 and 7 classes, respectively. These datasets encompass diverse environments, from urban traffic to rural roads and market areas, and provide detailed annotations for object orientation. We benchmark ten state-of-the-art object detection methods, including YOLOv8-v11 and orientation-specific approaches such as Oriented RepPoints, SASM, RTMDet, and Rotated Faster R-CNN, to evaluate their performance under real-world conditions. Our results reveal critical limitations in current methods when applied to motorbike-dominated traffic, highlighting challenges such as high object density, complex orientations, and varying environmental conditions. The UIT-Drone4 and UIT-Drone7 datasets are publicly available at <span><span>UIT-Drone4-Link</span><svg><path></path></svg></span> and <span><span>UIT-Drone7-Link</span><svg><path></path></svg></span>, respectively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104603"},"PeriodicalIF":3.1,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145278066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dylan Brown , Hans Grobler , Johan Pieter de Villiers
{"title":"FPEVO: Fused point-edge visual odometry for low-structured and low-textured scenes","authors":"Dylan Brown , Hans Grobler , Johan Pieter de Villiers","doi":"10.1016/j.jvcir.2025.104599","DOIUrl":"10.1016/j.jvcir.2025.104599","url":null,"abstract":"<div><div>Visual odometry is an essential component of vision-based robotic navigation systems. A primary limitation of existing visual odometry solutions is their inability to achieve satisfactory performance in both high- and low-textured regions. In this paper, a robust RGB-D visual odometry method is proposed that fuses point and edge features. By combining the descriptiveness of feature points with the structure provided by edge data, a method that is robust to low-textured scenes is developed. Edge features are first detected and grouped based on the Gestalt principles of continuity and proximity. Edge groups are then associated between the current and previous frames using point features in the vicinity of the edges. Pose estimation is thereafter performed by first matching points between associated edge groups, filtering these points based on structural constraints imposed by the edges, and estimating the motion of the agent. Compared to state-of-the-art alternatives, such as REVO, MSC-VO, DROID-VO and SplaTAM on the TUM RGB-D, ICL-NUIM and Tartan-Air datasets, the resulting method reduces the root mean square absolute trajectory error, and translational and rotational relative pose errors by up to 58%, 75%, and 82%, respectively. This indicates that our method is not only more accurate than current approaches, but also more consistent, especially in low-structured and low-textured environments.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104599"},"PeriodicalIF":3.1,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145265465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual-branch manifold information consistency for unsupervised visible–infrared person re-identification","authors":"Yanling Gao , Zhenyu Wang","doi":"10.1016/j.jvcir.2025.104595","DOIUrl":"10.1016/j.jvcir.2025.104595","url":null,"abstract":"<div><div>Unsupervised visible–infrared person re-identification focuses on the challenging task of matching individuals across different spectral modalities without labeled data. However, most existing pipelines construct correspondences exclusively from global representations, making them susceptible to modality-induced distortions that compromise cross-modal identity consistency. Moreover, the prevailing focus on label association often neglects the role of feature organization in preserving intra-class cohesion and inter-class separation, leading to identity dispersion and the erroneous grouping of visually similar but unrelated individuals. To address these limitations, we propose the dual-branch manifold information consistency framework comprising two modules. The first, dual-branch interactive feature enrichment, captures complementary global and region-specific patterns by building graph-based associations among image parts and applying attention-driven global–local interaction. The second, consistency-driven manifold refinement, learns modality-aware neighborhood structures via enhanced neighbor membership matrices and refines the manifold geometry through a globally aware coding rate-based objective and a locally aware cycle consistency constraint. Extensive experiments on popular datasets validate the superiority of our approach, highlighting its potential to significantly advance unsupervised visible–infrared person re-identification.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104595"},"PeriodicalIF":3.1,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145278065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dense hazy image dehazing network with progressive learning paradigm and frequency decoupling enhancement","authors":"Xinlai Guo , Yuzhen Zhang , Yanyun Tao","doi":"10.1016/j.jvcir.2025.104598","DOIUrl":"10.1016/j.jvcir.2025.104598","url":null,"abstract":"<div><div>Dense hazy image dehazing is a challenging task. When processing dense haze images, the multi-layer encoding compression of deep model often leads to the loss of originally high-frequency features. Under traditional supervised learning paradigms, it is difficult to obtain a clear image from a dense hazy one, and the convergence of model training cannot be guaranteed. To address these issues, we propose a novel U-Net-based model with frequency decoupling enhancement (FDE) to dehaze dense hazy images. The FDE decouples the multi-level frequency features of dense hazy images, preserving an image’s primary information and enhancing high-frequency details. The spatial-frequency interaction (SFI) module fuses high-level frequency features with spatial features, effectively making them complement each other. Meanwhile, the noise suppressor (NS) is designed to reduce the high frequency noise derived by FDE. Our progressive learning paradigm draws inspiration from transfer learning, where pretraining is conducted on a simplified version of the complex target task. This approach involves training a generative model to convert dense hazy images into light hazy images, followed by fine-tuning the model’s parameters to adapt to the more complex dense haze removal task. This strategy prevents training collapse during dense haze removal. Experimental results demonstrate that the proposed method achieves favorable subjective and objective performance across various dense hazy image dehazing datasets. The code for this work is available at https://github.com/Paris0703/progressive_dehazing.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104598"},"PeriodicalIF":3.1,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145220119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyang Liu , Zimeng Liu , Xingming Wu , Weihai Chen , Zhong Liu , Zhengguo Li
{"title":"CrossFlow: Learning cost volumes for optical flow by cross-matching local and non-local image features","authors":"Ziyang Liu , Zimeng Liu , Xingming Wu , Weihai Chen , Zhong Liu , Zhengguo Li","doi":"10.1016/j.jvcir.2025.104588","DOIUrl":"10.1016/j.jvcir.2025.104588","url":null,"abstract":"<div><div>Optical flow is the pixel-level correspondence between two consecutive video frames. The cost volume plays an important role in deep learning-based optical flow methods. It measures the dissimilarity and the matching cost between two pixels in consecutive frames. Extensive optical flow methods have revolved around the cost volume. Most existing work constructs the cost volume by computing the dot product between the features of the target image and the source images, which is generally extracted by a shared convolutional neural network (CNN). However, these methods cannot adequately address long-standing challenges such as motion blur and large displacements. In this study, we propose the CrossFlow, computing the cost volume by cross-matching the local and non-local image features. The local and non-local features are extracted by the CNN and the transformer, respectively. Then, a total of four kinds of cost volumes are computed, and they are fused adaptively through a Softmax layer. As such, the final cost volume contains both the high- and low-frequency information. It facilitates the network in finding the correct correspondences from images with motion blur and large displacements. The experimental results demonstrate that our optical flow estimation method outperforms the baseline method (CRAFT) by 7% and 10% on the publicly available benchmarks Sintel and KITTI respectively, revealing the effectiveness of the proposed cost volume.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104588"},"PeriodicalIF":3.1,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145265466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learned-MAP-OMP: An unrolled neural network for signal and image denoising","authors":"Pagoti Reshma , Srinivas Tenneti , Pradip Sasmal , Ramunaidu Randhi","doi":"10.1016/j.jvcir.2025.104592","DOIUrl":"10.1016/j.jvcir.2025.104592","url":null,"abstract":"<div><div>Learned Orthogonal Matching Pursuit (L-OMP) has been applied to signal and image denoising tasks. However, under high-noise scenarios, dictionaries generated by L-OMP networks often exhibit high coherence and poor convergence to the true dictionary, as they mimic OMP and fail to select optimal atoms from the learned dictionary. This results in error propagation, degrading L-OMP’s performance in signal denoising tasks. To address this, we propose an unrolled network based on Maximum a posteriori OMP (MAP-OMP), termed Learned-MAP-OMP (L-MAP-OMP). It learns the atoms of the dictionary with the highest MAP likelihood ratios by leveraging the statistical distributions of the measurement matrix, sparse signal, and noise vector. Numerical results demonstrate that dictionaries learned by L-MAP-OMP exhibit improved convergence to the true dictionary, lower coherence, and reduced test Mean Squared Error (MSE) in signal denoising tasks. In particularly, at a noise level of <span><math><mrow><mn>0</mn><mo>.</mo><mn>1</mn></mrow></math></span>, the coherence of the dictionary learned by L-MAP-OMP is <span><math><mrow><mn>0</mn><mo>.</mo><mn>29</mn></mrow></math></span>, while those learned by L-OMP and Learned Iterative Soft Thresholding Algorithm (LISTA) are <span><math><mrow><mn>0</mn><mo>.</mo><mn>98</mn></mrow></math></span> and <span><math><mrow><mn>0</mn><mo>.</mo><mn>48</mn></mrow></math></span>, respectively. Consequently, we observe that L-MAP-OMP achieves a test MSE of approximately <span><math><mrow><mo>−</mo><mn>27</mn></mrow></math></span> dB, outperforming L-OMP and LISTA, which attain test MSE around <span><math><mrow><mo>−</mo><mn>22</mn></mrow></math></span> dB and <span><math><mrow><mo>−</mo><mn>20</mn></mrow></math></span> dB, respectively. Furthermore, in image denoising tasks, L-MAP-OMP showed statistically significant difference (<span><math><mrow><mi>p</mi><mo><</mo><mn>0</mn><mo>.</mo><mn>05</mn></mrow></math></span>) in PSNR and SSIM compared to L-OMP, LISTA, DnCNN, and BM3D. Model selection based on Cohen’s d, mean, and variance further confirmed its superiority<span><math><mo>−</mo></math></span>outperforming LISTA and BM3D , surpassing L-OMP in several noise scenarios, and remaining competitive with DnCNN.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104592"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145220120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jielin Jiang , Quan Zhang , Yan Cui , Shun Wei , Yingnan Zhao
{"title":"CAgMLP: An MLP-like architecture with a Cross-Axis gated token mixer for image classification","authors":"Jielin Jiang , Quan Zhang , Yan Cui , Shun Wei , Yingnan Zhao","doi":"10.1016/j.jvcir.2025.104590","DOIUrl":"10.1016/j.jvcir.2025.104590","url":null,"abstract":"<div><div>Recent MLP-based models have employed axial projections to orthogonally decompose the entire space into horizontal and vertical directions, effectively balancing long-range dependencies and computational costs. However, such methods operate independently along the two axes, hindering their ability to capture the image’s global spatial structure. In this paper, we propose a novel MLP architecture called Cross-Axis gated MLP (CAgMLP), which consists of two main modules, Cross-Axis Gated Token-Mixing MLP (CGTM) and Convolutional Gated Channel-Mixing MLP (CGCM). CGTM addresses the loss of information from single-dimensional interactions by leveraging a multiplicative gating mechanism that facilitates the cross-fusion of features captured along the two spatial axes, enhancing feature selection and information flow. CGCM improves the dual-branch structure of the multiplicative gating units by projecting the fused low-dimensional input into two high-dimensional feature spaces and introducing non-linear features through element-wise multiplication, further improving the model’s expressive ability. Finally, both modules incorporate local token aggregation to compensate for the lack of local inductive bias in traditional MLP models. Experiments conducted on several datasets demonstrate that CAgMLP achieves superior classification performance compared to other state-of-the-art methods, while exhibiting fewer parameters and lower computational complexity.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104590"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image forgery localization with sparse reward compensation using curiosity-driven deep reinforcement learning","authors":"Yan Cheng , Xiong Li , Xin Zhang , Chaohong Yang","doi":"10.1016/j.jvcir.2025.104587","DOIUrl":"10.1016/j.jvcir.2025.104587","url":null,"abstract":"<div><div>Advanced editing and deepfakes make image tampering harder to detect, threatening image security, credibility, and personal privacy. To address this challenging issue, we propose a novel end-to-end image forgery localization method, based on the curiosity-driven deep reinforcement learning method with intrinsic reward. The proposed method provides reliable localization results for forged regions in images of various types of forgery. This study designs a new Focal-based reward function that is suitable for scenarios with highly imbalanced numbers of forged and real pixels. Furthermore, considering the issue of sparse rewards caused by sparse forgery regions in real-world forgery scenarios, we introduce a surprise-based intrinsic reward generation module, which guides the agent to explore and learn the optimal strategy. Extensive experiments conducted on multiple benchmark datasets show that the proposed method outperforms other methods in pixel-level forgery localization. Additionally, the proposed method demonstrates stable robustness to image degradation caused by different post-processing attacks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104587"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure preserving point cloud completion and classification with coarse-to-fine information","authors":"Seema Kumari , Srimanta Mandal , Shanmuganathan Raman","doi":"10.1016/j.jvcir.2025.104591","DOIUrl":"10.1016/j.jvcir.2025.104591","url":null,"abstract":"<div><div>Point clouds are the predominant data structure for representing 3D shapes. However, captured point clouds are often partial due to practical constraints, necessitating point cloud completion. In this paper, we propose a novel deep network architecture that preserves the structure of available points while incorporating coarse-to-fine information to generate dense and consistent point clouds. Our network comprises three sub-networks: Coarse-to-Fine, Structure, and Tail. The Coarse-to-Fine sub-net extracts multi-scale features, while the Structure sub-net utilizes a stacked auto-encoder with weighted skip connections to preserve structural information. The fused features are then processed by the Tail sub-net to produce a dense point cloud. Additionally, we demonstrate the effectiveness of our structure-preserving approach in point cloud classification by proposing a classification architecture based on the Structure sub-net. Experimental results show that our method outperforms existing approaches in both tasks, highlighting the importance of preserving structural information and incorporating coarse-to-fine details.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104591"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145220121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Wang, Jiajia Hu, Fengyuan Zuo, Junfei Shi, Haiyan Jin
{"title":"F-MDM: Rethinking image denoising with a feature map-based Poisson–Gaussian Mixture Diffusion Model","authors":"Bin Wang, Jiajia Hu, Fengyuan Zuo, Junfei Shi, Haiyan Jin","doi":"10.1016/j.jvcir.2025.104593","DOIUrl":"10.1016/j.jvcir.2025.104593","url":null,"abstract":"<div><div>In image-denoising tasks, the diffusion model has shown great potential. Usually, the diffusion model uses a real scene’s noise-free and clean image dataset as the starting point for diffusion. When the denoising network trained on this dataset is applied to image denoising in other scenes, the generalization of the denoising network will decrease due to changes in scene priors. In order to improve generalization, we hope to find a clean image dataset that not only has rich scene priors but also has a certain scene independence. The VGG-16 network is a network trained from a large number of images. After the real scene images are processed through the VGG-16 convolution layer, the shallow feature maps obtained have scene priors and break free from the scene dependency caused by minor details. This paper uses the shallow feature maps of VGG-16 as a clean image dataset for the diffusion model, and the results of denoising experiments are surprising. Furthermore, considering that the noise of the image mainly includes Gaussian noise and Poisson noise, the classical diffusion model uses Gaussian noise for diffusion to improve the interpretability of the model. We introduce a novel Poisson–Gaussian noise mixture for the diffusion process, and the theoretical derivation is given. Finally, we propose a Poisson–Gaussian Denoising <strong>M</strong>ixture <strong>D</strong>iffusion <strong>M</strong>odel based on <strong>F</strong>eature maps (<strong>F-MDM</strong>). Experiments demonstrate that our method exhibits excellent generalization ability compared to some other advanced algorithms.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104593"},"PeriodicalIF":3.1,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145157697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}