{"title":"PMDNet: A multi-stage approach to single image dehazing with contextual and spatial feature preservation","authors":"D. Pushpalatha, P. Prithvi","doi":"10.1016/j.jvcir.2024.104379","DOIUrl":"10.1016/j.jvcir.2024.104379","url":null,"abstract":"<div><div>Hazy images suffer from degraded contrast and visibility due to atmospheric factors, affecting the accuracy of object detection in computer vision tasks. To address this, we propose a novel Progressive Multiscale Dehazing Network (PMDNet) for restoring the original quality of hazy images. Our network aims to balance high-level contextual information and spatial details effectively during the image recovery process. PMDNet employs a multi-stage architecture that gradually learns to remove haze by breaking down the dehazing process into manageable steps. Starting with a U-Net encoder-decoder to capture high-level context, PMDNet integrates a subnetwork to preserve local feature details. A SAN reweights features at each stage, ensuring smooth information transfer and preventing loss through cross-connections. Extensive experiments on datasets like RESIDE, I-HAZE, O-HAZE, D-HAZE, REAL-HAZE48, RTTS and Forest datasets, demonstrate the robustness of PMDNet, achieving strong qualitative and quantitative results.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104379"},"PeriodicalIF":2.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lightweight gesture recognition network","authors":"Jinzhao Guo, Xuemei Lei, Bo Li","doi":"10.1016/j.jvcir.2024.104362","DOIUrl":"10.1016/j.jvcir.2024.104362","url":null,"abstract":"<div><div>As one of the main human–computer interaction methods, gesture recognition has an urgent issue to be addressed, which huge paramaters and massive computation of the classification and recognition algorithm cause high cost in practical applications. To reduce cost and enhance the detection efficiency, a lightweight model of gesture recognition algorithms is proposed in this paper, based on the YOLOv5s framework. Firstly, we adopt ShuffleNetV2 as the backbone network to reduce the computational load and enhance the model’s detection speed. Additionally, lightweight modules such as GSConv and VoVGSCSP are introduced into the neck network to further compress the model size while maintaining accuracy. Furthermore, the BiFPN (Bi-directional Feature Pyramid Network) structure is incorporated to enhance the network’s detection accuracy at a lower computational cost. Lastly, we introduce the Coordinate Attention (CA) mechanism to enhance the network’s focus on key features. To investigate the rationale behind the introduction of the CA attention mechanism and the BiFPN network structure, we analyze the extracted features and validate the network’s attention on different parts of the feature maps through visualization. Experimental results demonstrate that the proposed algorithm achieves an average precision of 95.2% on the HD-HaGRID dataset. Compared to the original YOLOv5s model, the proposal model reduces the parameter count by 70.6% and the model size by 69.2%. Therefore, this model is suitable for real-time gesture recognition classification and detection, demonstrating significant potential for practical applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104362"},"PeriodicalIF":2.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing industrial anomaly detection with Mamba-inspired feature fusion","authors":"Mingjing Pei , Xiancun Zhou , Yourui Huang , Fenghui Zhang , Mingli Pei , Yadong Yang , Shijian Zheng , Mai Xin","doi":"10.1016/j.jvcir.2024.104368","DOIUrl":"10.1016/j.jvcir.2024.104368","url":null,"abstract":"<div><div>Image anomaly detection is crucial in industrial applications, with significant research value and practical application potential. Despite recent advancements using image segmentation techniques, challenges remain in global feature extraction, computational complexity, and pixel-level anomaly localization. A scheme is designed to address the issues above. First, the Mamba concept is introduced to enhance global feature extraction while reducing computational complexity. This dual benefit optimizes performance in both aspects. Second, an effective feature fusion module is designed to integrate low-level information into high-level features, improving segmentation accuracy by enabling more precise decoding. The proposed model was evaluated on three datasets, including MVTec AD, BTAD, and AeBAD, demonstrating superior performance across different types of anomalies. Specifically, on the MVTec AD dataset, our method achieved an average AUROC of 99.1% for image-level anomalies and 98.1% for pixel-level anomalies, including a state-of-the-art (SOTA) result of 100% AUROC in the texture anomaly category. These results demonstrate the effectiveness of our method as a valuable reference for industrial image anomaly detection.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104368"},"PeriodicalIF":2.6,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lindong Wang , Hongya Tuo , Yu Yuan , Henry Leung , Zhongliang Jing
{"title":"RCMixer: Radar-camera fusion based on vision transformer for robust object detection","authors":"Lindong Wang , Hongya Tuo , Yu Yuan , Henry Leung , Zhongliang Jing","doi":"10.1016/j.jvcir.2024.104367","DOIUrl":"10.1016/j.jvcir.2024.104367","url":null,"abstract":"<div><div>In real-world object detection applications, the camera would be affected by poor lighting conditions, resulting in a deteriorate performance. Millimeter-wave radar and camera have complementary advantages, radar point cloud can help detecting small objects under low light. In this study, we focus on feature-level fusion and propose a novel end-to-end detection network RCMixer. RCMixer mainly includes depth pillar expansion(DPE), hierarchical vision transformer and radar spatial attention (RSA) module. DPE enhances radar projection image according to perspective principle and invariance assumption of adjacent depth; The hierarchical vision transformer backbone alternates the feature extraction of spatial dimension and channel dimension; RSA extracts the radar attention, then it fuses radar and camera features at the late stage. The experiment results on nuScenes dataset show that the accuracy of RCMixer exceeds all comparison networks and its detection ability of small objects in dark light is better than the camera-only method. In addition, the ablation study demonstrates the effectiveness of our method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104367"},"PeriodicalIF":2.6,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale UAV image stitching based on global registration optimization and graph-cut method","authors":"Zhongxing Wang , Zhizhong Fu , Jin Xu","doi":"10.1016/j.jvcir.2024.104354","DOIUrl":"10.1016/j.jvcir.2024.104354","url":null,"abstract":"<div><div>This paper presents a large-scale unmanned aerial vehicle (UAV) image stitching method based on global registration optimization and the graph-cut technique. To minimize cumulative registration errors in large-scale image stitching, we propose a two-step global registration optimization approach, which includes affine transformation optimization followed by projective transformation optimization. Evenly distributed matching points are used to formulate the objective function for registration optimization, with the optimal affine transformation serving as the initial value for projective transformation optimization. Additionally, a rigid constraint is incorporated as the regularization term for projective transformation optimization to preserve shape and prevent unnatural warping of the aligned images. After global registration, the graph-cut method is employed to blend the aligned images and generate the final mosaic. The proposed method is evaluated on five UAV-captured remote sensing image datasets. Experimental results demonstrate that our approach effectively aligns multiple images and produces high-quality, seamless mosaics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104354"},"PeriodicalIF":2.6,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lichuan Geng , Jie Chen , Yun Tie , Lin Qi , Chengwu Liang
{"title":"Dynamic gesture recognition using 3D central difference separable residual LSTM coordinate attention networks","authors":"Lichuan Geng , Jie Chen , Yun Tie , Lin Qi , Chengwu Liang","doi":"10.1016/j.jvcir.2024.104364","DOIUrl":"10.1016/j.jvcir.2024.104364","url":null,"abstract":"<div><div>The area of human–computer interaction has generated considerable interest in dynamic gesture recognition. However, the intrinsic qualities of the gestures themselves, including their flexibility and spatial scale, as well as external factors such as lighting and background, have impeded the improvement of recognition accuracy. To address this, we present a novel end-to-end recognition network named 3D Central Difference Separable Residual Long Short-Term Memory (LSTM) Coordinate Attention (3D CRLCA) in this paper. The network is composed of three components: (1) 3D Central Difference Separable Convolution (3D CDSC), (2) a residual module to enhance the network’s capability to distinguish between categories, and (3) an LSTM-Coordinate Attention (LSTM-CA) module to direct the network’s attention to the gesture region and its temporal and spatial characteristics. Our experiments using the ChaLearn Large-scale Gesture Recognition Dataset (IsoGD) and IPN datasets demonstrate the effectiveness of our approach, surpassing other existing methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104364"},"PeriodicalIF":2.6,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeju Wu , Kaiming Chen , Panxin Ji , Haoran Zhao , Xin Sun
{"title":"MSFFT-Net: A multi-scale feature fusion transformer network for underwater image enhancement","authors":"Zeju Wu , Kaiming Chen , Panxin Ji , Haoran Zhao , Xin Sun","doi":"10.1016/j.jvcir.2024.104355","DOIUrl":"10.1016/j.jvcir.2024.104355","url":null,"abstract":"<div><div>Due to light attenuation and scattering, underwater images typically experience various levels of degradation. This degradation adversely affect object detection and recognition in underwater imagery. Nevertheless, the methods based on convolutional networks have limitations in capturing long-distance dependencies and the methods based on generative adversarial networks exhibit a poor enhancement effect on local detail features. To address this issue, we propose a Multi-Scale Feature Fusion Transformer Network (MSFFT-Net). We design an Underwater Transformer Feature Extraction Module (UTFEM) for conducting window self-attention calculations via maskless reflection filling, thereby enabling the capture of long-distance dependencies. The Channel Transformer Selective Kernel Fusion module (CTSKF) is devised as a replacement for the skip connection. By employing one-stage multi-scale feature coding recombination and two-stage selective kernel (SK) fusion, the model enhances its focus on local detailed features. Extensive experiments on three publicly available datasets demonstrate that our MSFFT-Net achieves better performance than some well-recognized technologies.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104355"},"PeriodicalIF":2.6,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DA4NeRF: Depth-aware Augmentation technique for Neural Radiance Fields","authors":"Hamed Razavi Khosroshahi , Jaime Sancho , Gun Bang , Gauthier Lafruit , Eduardo Juarez , Mehrdad Teratani","doi":"10.1016/j.jvcir.2024.104365","DOIUrl":"10.1016/j.jvcir.2024.104365","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRF) demonstrate impressive capabilities in rendering novel views of specific scenes by learning an implicit volumetric representation from posed RGB images without any depth information. View synthesis is the computational process of synthesizing novel images of a scene from different viewpoints, based on a set of existing images. One big problem is the need for a large number of images in the training datasets for neural network-based view synthesis frameworks. The challenge of data augmentation for view synthesis applications has not been addressed yet. NeRF models require comprehensive scene coverage in multiple views to accurately estimate radiance and density at any point. In cases without sufficient coverage of scenes with different viewing directions, cannot effectively interpolate or extrapolate unseen scene parts. In this paper, we introduce a new pipeline to tackle this data augmentation problem using depth data. We use MPEG’s Depth Estimation Reference Software and Reference View Synthesizer to add novel non-existent views to the training sets needed for the NeRF framework. Experimental results show that our approach improves the quality of the rendered images using NeRF’s model. The average quality increased by 6.4 dB in terms of Peak Signal-to-Noise Ratio (PSNR), with the highest increase being 11 dB. Our approach not only adds the ability to handle the sparsely captured multiview content to be used in the NeRF framework, but also makes NeRF more accurate and useful for creating high-quality virtual views.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104365"},"PeriodicalIF":2.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143173436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic-aware representations for unsupervised Camouflaged Object Detection","authors":"Zelin Lu, Xing Zhao, Liang Xie, Haoran Liang, Ronghua Liang","doi":"10.1016/j.jvcir.2024.104366","DOIUrl":"10.1016/j.jvcir.2024.104366","url":null,"abstract":"<div><div>Unsupervised image segmentation algorithms face challenges due to the lack of human annotations. They typically employ representations derived from self-supervised models to generate pseudo-labels for supervising model training. Using this strategy, the model’s performance largely depends on the quality of the generated pseudo-labels. In this study, we design an unsupervised framework to perform COD (Camouflaged Object Detection) without the need for generating pseudo-labels. Specifically, we utilize semantic-aware representations, trained in a self-supervised manner on large-scale unlabeled datasets, to guide the training process. These representations not only capturing rich contextual semantic information but also assist in refining the blurred boundaries of camouflaged objects. Furthermore, we design a framework that integrates these semantic-aware representations with task-specific features, enabling the model to perform the UCOD (Unsupervised Camouflaged Object Detection) task with enhanced contextual understanding. Moreover, we introduce an innovative multi-scale token loss function, which maintain the structural integrity of objects at various scales in the model’s predictions through mutual supervision between different features and scales. Extensive experimental validation demonstrates that our model significantly enhances the performance of UCOD, closely approaching the capabilities of state-of-the-art weakly-supervised COD models.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104366"},"PeriodicalIF":2.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143173439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DRGNet: Dual-Relation Graph Network for point cloud analysis","authors":"Ce Zhou, Qiang Ling","doi":"10.1016/j.jvcir.2024.104353","DOIUrl":"10.1016/j.jvcir.2024.104353","url":null,"abstract":"<div><div>Recently point cloud analysis has attracted more and more attention. However, it is a challenging task because point clouds are irregular, sparse, and unordered. To accomplish that task, this paper proposes Dual Relation Convolution (DRConv) which utilizes both geometric relations and feature-level relations to effectively aggregate discriminative features. The geometric relations take the explicit geometric structures to establish the spatial connections in the local regions while the implicit feature-level relations are taken to capture the neighboring points with the same semantic properties. Based on our proposed DRConv, we construct a Dual-Relation Graph Network (DRGNet) for point cloud analysis. To capture long-range contextual information, our DRGNet searches for neighboring points in both 3D geometric space and feature space to effectively aggregate local and distant information. Furthermore, we propose a Channel Attention Block (CAB), which puts more emphasis on important feature channels and effectively captures global information, and can further improve the performance of point cloud segmentation. Extensive experiments on object classification, shape part segmentation, normal estimation, and semantic segmentation tasks demonstrate that our proposed methods can achieve superior performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104353"},"PeriodicalIF":2.6,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}