{"title":"Enhancing industrial anomaly detection with Mamba-inspired feature fusion","authors":"Mingjing Pei , Xiancun Zhou , Yourui Huang , Fenghui Zhang , Mingli Pei , Yadong Yang , Shijian Zheng , Mai Xin","doi":"10.1016/j.jvcir.2024.104368","DOIUrl":"10.1016/j.jvcir.2024.104368","url":null,"abstract":"<div><div>Image anomaly detection is crucial in industrial applications, with significant research value and practical application potential. Despite recent advancements using image segmentation techniques, challenges remain in global feature extraction, computational complexity, and pixel-level anomaly localization. A scheme is designed to address the issues above. First, the Mamba concept is introduced to enhance global feature extraction while reducing computational complexity. This dual benefit optimizes performance in both aspects. Second, an effective feature fusion module is designed to integrate low-level information into high-level features, improving segmentation accuracy by enabling more precise decoding. The proposed model was evaluated on three datasets, including MVTec AD, BTAD, and AeBAD, demonstrating superior performance across different types of anomalies. Specifically, on the MVTec AD dataset, our method achieved an average AUROC of 99.1% for image-level anomalies and 98.1% for pixel-level anomalies, including a state-of-the-art (SOTA) result of 100% AUROC in the texture anomaly category. These results demonstrate the effectiveness of our method as a valuable reference for industrial image anomaly detection.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104368"},"PeriodicalIF":2.6,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lindong Wang , Hongya Tuo , Yu Yuan , Henry Leung , Zhongliang Jing
{"title":"RCMixer: Radar-camera fusion based on vision transformer for robust object detection","authors":"Lindong Wang , Hongya Tuo , Yu Yuan , Henry Leung , Zhongliang Jing","doi":"10.1016/j.jvcir.2024.104367","DOIUrl":"10.1016/j.jvcir.2024.104367","url":null,"abstract":"<div><div>In real-world object detection applications, the camera would be affected by poor lighting conditions, resulting in a deteriorate performance. Millimeter-wave radar and camera have complementary advantages, radar point cloud can help detecting small objects under low light. In this study, we focus on feature-level fusion and propose a novel end-to-end detection network RCMixer. RCMixer mainly includes depth pillar expansion(DPE), hierarchical vision transformer and radar spatial attention (RSA) module. DPE enhances radar projection image according to perspective principle and invariance assumption of adjacent depth; The hierarchical vision transformer backbone alternates the feature extraction of spatial dimension and channel dimension; RSA extracts the radar attention, then it fuses radar and camera features at the late stage. The experiment results on nuScenes dataset show that the accuracy of RCMixer exceeds all comparison networks and its detection ability of small objects in dark light is better than the camera-only method. In addition, the ablation study demonstrates the effectiveness of our method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104367"},"PeriodicalIF":2.6,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-scale UAV image stitching based on global registration optimization and graph-cut method","authors":"Zhongxing Wang , Zhizhong Fu , Jin Xu","doi":"10.1016/j.jvcir.2024.104354","DOIUrl":"10.1016/j.jvcir.2024.104354","url":null,"abstract":"<div><div>This paper presents a large-scale unmanned aerial vehicle (UAV) image stitching method based on global registration optimization and the graph-cut technique. To minimize cumulative registration errors in large-scale image stitching, we propose a two-step global registration optimization approach, which includes affine transformation optimization followed by projective transformation optimization. Evenly distributed matching points are used to formulate the objective function for registration optimization, with the optimal affine transformation serving as the initial value for projective transformation optimization. Additionally, a rigid constraint is incorporated as the regularization term for projective transformation optimization to preserve shape and prevent unnatural warping of the aligned images. After global registration, the graph-cut method is employed to blend the aligned images and generate the final mosaic. The proposed method is evaluated on five UAV-captured remote sensing image datasets. Experimental results demonstrate that our approach effectively aligns multiple images and produces high-quality, seamless mosaics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104354"},"PeriodicalIF":2.6,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lichuan Geng , Jie Chen , Yun Tie , Lin Qi , Chengwu Liang
{"title":"Dynamic gesture recognition using 3D central difference separable residual LSTM coordinate attention networks","authors":"Lichuan Geng , Jie Chen , Yun Tie , Lin Qi , Chengwu Liang","doi":"10.1016/j.jvcir.2024.104364","DOIUrl":"10.1016/j.jvcir.2024.104364","url":null,"abstract":"<div><div>The area of human–computer interaction has generated considerable interest in dynamic gesture recognition. However, the intrinsic qualities of the gestures themselves, including their flexibility and spatial scale, as well as external factors such as lighting and background, have impeded the improvement of recognition accuracy. To address this, we present a novel end-to-end recognition network named 3D Central Difference Separable Residual Long Short-Term Memory (LSTM) Coordinate Attention (3D CRLCA) in this paper. The network is composed of three components: (1) 3D Central Difference Separable Convolution (3D CDSC), (2) a residual module to enhance the network’s capability to distinguish between categories, and (3) an LSTM-Coordinate Attention (LSTM-CA) module to direct the network’s attention to the gesture region and its temporal and spatial characteristics. Our experiments using the ChaLearn Large-scale Gesture Recognition Dataset (IsoGD) and IPN datasets demonstrate the effectiveness of our approach, surpassing other existing methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104364"},"PeriodicalIF":2.6,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeju Wu , Kaiming Chen , Panxin Ji , Haoran Zhao , Xin Sun
{"title":"MSFFT-Net: A multi-scale feature fusion transformer network for underwater image enhancement","authors":"Zeju Wu , Kaiming Chen , Panxin Ji , Haoran Zhao , Xin Sun","doi":"10.1016/j.jvcir.2024.104355","DOIUrl":"10.1016/j.jvcir.2024.104355","url":null,"abstract":"<div><div>Due to light attenuation and scattering, underwater images typically experience various levels of degradation. This degradation adversely affect object detection and recognition in underwater imagery. Nevertheless, the methods based on convolutional networks have limitations in capturing long-distance dependencies and the methods based on generative adversarial networks exhibit a poor enhancement effect on local detail features. To address this issue, we propose a Multi-Scale Feature Fusion Transformer Network (MSFFT-Net). We design an Underwater Transformer Feature Extraction Module (UTFEM) for conducting window self-attention calculations via maskless reflection filling, thereby enabling the capture of long-distance dependencies. The Channel Transformer Selective Kernel Fusion module (CTSKF) is devised as a replacement for the skip connection. By employing one-stage multi-scale feature coding recombination and two-stage selective kernel (SK) fusion, the model enhances its focus on local detailed features. Extensive experiments on three publicly available datasets demonstrate that our MSFFT-Net achieves better performance than some well-recognized technologies.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104355"},"PeriodicalIF":2.6,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DA4NeRF: Depth-aware Augmentation technique for Neural Radiance Fields","authors":"Hamed Razavi Khosroshahi , Jaime Sancho , Gun Bang , Gauthier Lafruit , Eduardo Juarez , Mehrdad Teratani","doi":"10.1016/j.jvcir.2024.104365","DOIUrl":"10.1016/j.jvcir.2024.104365","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRF) demonstrate impressive capabilities in rendering novel views of specific scenes by learning an implicit volumetric representation from posed RGB images without any depth information. View synthesis is the computational process of synthesizing novel images of a scene from different viewpoints, based on a set of existing images. One big problem is the need for a large number of images in the training datasets for neural network-based view synthesis frameworks. The challenge of data augmentation for view synthesis applications has not been addressed yet. NeRF models require comprehensive scene coverage in multiple views to accurately estimate radiance and density at any point. In cases without sufficient coverage of scenes with different viewing directions, cannot effectively interpolate or extrapolate unseen scene parts. In this paper, we introduce a new pipeline to tackle this data augmentation problem using depth data. We use MPEG’s Depth Estimation Reference Software and Reference View Synthesizer to add novel non-existent views to the training sets needed for the NeRF framework. Experimental results show that our approach improves the quality of the rendered images using NeRF’s model. The average quality increased by 6.4 dB in terms of Peak Signal-to-Noise Ratio (PSNR), with the highest increase being 11 dB. Our approach not only adds the ability to handle the sparsely captured multiview content to be used in the NeRF framework, but also makes NeRF more accurate and useful for creating high-quality virtual views.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104365"},"PeriodicalIF":2.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143173436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic-aware representations for unsupervised Camouflaged Object Detection","authors":"Zelin Lu, Xing Zhao, Liang Xie, Haoran Liang, Ronghua Liang","doi":"10.1016/j.jvcir.2024.104366","DOIUrl":"10.1016/j.jvcir.2024.104366","url":null,"abstract":"<div><div>Unsupervised image segmentation algorithms face challenges due to the lack of human annotations. They typically employ representations derived from self-supervised models to generate pseudo-labels for supervising model training. Using this strategy, the model’s performance largely depends on the quality of the generated pseudo-labels. In this study, we design an unsupervised framework to perform COD (Camouflaged Object Detection) without the need for generating pseudo-labels. Specifically, we utilize semantic-aware representations, trained in a self-supervised manner on large-scale unlabeled datasets, to guide the training process. These representations not only capturing rich contextual semantic information but also assist in refining the blurred boundaries of camouflaged objects. Furthermore, we design a framework that integrates these semantic-aware representations with task-specific features, enabling the model to perform the UCOD (Unsupervised Camouflaged Object Detection) task with enhanced contextual understanding. Moreover, we introduce an innovative multi-scale token loss function, which maintain the structural integrity of objects at various scales in the model’s predictions through mutual supervision between different features and scales. Extensive experimental validation demonstrates that our model significantly enhances the performance of UCOD, closely approaching the capabilities of state-of-the-art weakly-supervised COD models.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104366"},"PeriodicalIF":2.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143173439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DRGNet: Dual-Relation Graph Network for point cloud analysis","authors":"Ce Zhou, Qiang Ling","doi":"10.1016/j.jvcir.2024.104353","DOIUrl":"10.1016/j.jvcir.2024.104353","url":null,"abstract":"<div><div>Recently point cloud analysis has attracted more and more attention. However, it is a challenging task because point clouds are irregular, sparse, and unordered. To accomplish that task, this paper proposes Dual Relation Convolution (DRConv) which utilizes both geometric relations and feature-level relations to effectively aggregate discriminative features. The geometric relations take the explicit geometric structures to establish the spatial connections in the local regions while the implicit feature-level relations are taken to capture the neighboring points with the same semantic properties. Based on our proposed DRConv, we construct a Dual-Relation Graph Network (DRGNet) for point cloud analysis. To capture long-range contextual information, our DRGNet searches for neighboring points in both 3D geometric space and feature space to effectively aggregate local and distant information. Furthermore, we propose a Channel Attention Block (CAB), which puts more emphasis on important feature channels and effectively captures global information, and can further improve the performance of point cloud segmentation. Extensive experiments on object classification, shape part segmentation, normal estimation, and semantic segmentation tasks demonstrate that our proposed methods can achieve superior performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104353"},"PeriodicalIF":2.6,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wien Hong , Guan-Zhong Su , Wei-Ling Lin , Tung-Shou Chen
{"title":"Virtualized three-dimensional reference tables for efficient data embedding","authors":"Wien Hong , Guan-Zhong Su , Wei-Ling Lin , Tung-Shou Chen","doi":"10.1016/j.jvcir.2024.104351","DOIUrl":"10.1016/j.jvcir.2024.104351","url":null,"abstract":"<div><div>Data embedding methods utilizing a three-dimensional reference table (3DRT) modify pixels to embed digits from various bases using the 3DRT. However, the current 3DRT-based methods are constrained to specific bases and necessitate a physical 3DRT for both embedding and extraction processes. This paper introduces a novel approach that constructs the 3DRT using groups of anisotropic cubes to minimize embedding distortion. The 3DRT is virtualized by representing it as a two-coefficient equation, eliminating the need for a physical 3DRT during embedding and extraction. This virtualization significantly reduces computational complexity, enabling embedding and extraction through straightforward calculations. Furthermore, virtualization decreases the storage space required for the 3DRT. Experimental results demonstrate that the proposed method achieves high image quality and embedding capacity. Specifically, at embedding rate of 2 and 3 bits per pixel, the method produces quality scores of 46.99 dB and 40.91 dB, respectively, across 200 test images.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104351"},"PeriodicalIF":2.6,"publicationDate":"2024-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143173437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-exposure image fusion using adaptive color dissimilarity and dynamic equalization techniques","authors":"Jishnu C.R., Vishnukumar S.","doi":"10.1016/j.jvcir.2024.104350","DOIUrl":"10.1016/j.jvcir.2024.104350","url":null,"abstract":"<div><div>In the domain of image processing, Multi-Exposure Image Fusion (MEF) emerges as a crucial technique for developing high dynamic range (HDR) representations from fusing sequences of low dynamic range images. Conventional fusion methods often suffer from shortcomings such as detail loss, edge artifacts, and color inconsistencies, thereby compromising the quality of the fused output which is further diminished with extremely exposed and limited inputs. While there have been a few efforts to conduct fusion on limited and impaired static input images, there has been no exploration into the fusion of dynamic image sets. This paper proposes an effective MEF approach that operates on a minimum of two extremely exposed, limited datasets of both static and dynamic scenes. The approach initiates with categorizing input images into under-exposed and over-exposed categories based on lighting levels, subsequently applying tailored exposure correction strategies. Through iterative refinement and selection of optimally exposed variant, we construct an advanced intermediate stack, upon which fusion is performed by a pyramidal fusion technique. The method relies on adaptive well-exposedness and color gradient to develop weight maps for pyramidal fusion. The initial weights are refined using a Gaussian filter and this results in the creation of a seamlessly fused image with expanded dynamic range. Additionally, for dynamic imagery, we propose an adaptive color dissimilarity and dynamic equalization to reduce ghosting artifacts. Comparative assessments against existing methodologies, both visually and empirically confirms the superior performance of the proposed model.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104350"},"PeriodicalIF":2.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143173438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}