{"title":"Light field salient object detection network based on feature enhancement and mutual attention","authors":"Xi Zhu, Huai Xia, Xucheng Wang, Zhenrong Zheng","doi":"10.1117/1.jei.33.5.053001","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053001","url":null,"abstract":"Light field salient object detection (SOD) is an essential research topic in computer vision, but robust saliency detection in complex scenes is still very challenging. We propose a new method for accurate and robust light field SOD via convolutional neural networks containing feature enhancement modules. First, the light field dataset is extended by geometric transformations such as stretching, cropping, flipping, and rotating. Next, two feature enhancement modules are designed to extract features from RGB images and depth maps, respectively. The obtained feature maps are fed into a two-stream network to train the light field SOD. We propose a mutual attention approach in this process, extracting and fusing features from RGB images and depth maps. Therefore, our network can generate an accurate saliency map from the input light field images after training. The obtained saliency map can provide reliable a priori information for tasks such as semantic segmentation, target recognition, and visual tracking. Experimental results show that the proposed method achieves excellent detection performance in public benchmark datasets and outperforms the state-of-the-art methods. We also verify the generalization and stability of the method in real-world experiments.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"8 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Small space target detection using megapixel resolution CeleX-V camera","authors":"Yuanyuan Lv, Liang Zhou, Zhaohui Liu, Wenlong Qiao, Haiyang Zhang","doi":"10.1117/1.jei.33.5.053002","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053002","url":null,"abstract":"An event camera (EC) is a bioinspired vision sensor with the advantages of a high temporal resolution, high dynamic range, and low latency. Due to the inherent sparsity of space target imaging data, EC becomes an ideal imaging sensor for space target detection. In this work, we conduct detection of small space targets using a CeleX-V camera with a megapixel resolution. We propose a target detection method based on field segmentation, utilizing the event output characteristics of an EC. This method enables real-time monitoring of the spatial positions of space targets within the camera’s field of view. The effectiveness of this approach is validated through experiments involving real-world observations of space targets. Using the proposed method, real-time observation of space targets with a megapixel resolution EC becomes feasible, demonstrating substantial practical potential in the field of space target detection.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"3 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video anomaly detection based on frame memory bank and decoupled asymmetric convolutions","authors":"Min Zhao, Chuanxu Wang, Jiajiong Li, Zitai Jiang","doi":"10.1117/1.jei.33.5.053006","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053006","url":null,"abstract":"Video anomaly detection (VAD) is essential for monitoring systems. The prediction-based methods identify anomalies by comparing differences between the predicted and real frames. We propose an unsupervised VAD method based on frame memory bank (FMB) and decoupled asymmetric convolution (DAConv), which addresses three problems encountered with auto-encoders (AE) in VAD: (1) how to mitigate the noise resulting from jittering between frames, which is ignored; (2) how to alleviate the insufficient utilization of temporal information by traditional two-dimensional (2D) convolution and the burden for more computing resources in three-dimensional (3D) convolution; and (3) how to make full use of normal data to improve the reliability of anomaly discrimination. Specifically, we initially design a separate network to calibrate video frames within the dataset. Second, we design DAConv to extract features from the video, addressing the absence of temporal dimension information in 2D convolutions and the high computational complexity of 3D convolutions. Concurrently, the interval-frame mechanism mitigates the problem of information redundancy caused by data reuse. Finally, we embed an FMB to store features of normal events, amplifying the contrast between normal and abnormal frames. We conduct extensive experiments on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, achieving AUC values of 98.7%, 90.4%, and 74.8%, respectively, which fully demonstrates the rationality and effectiveness of the proposed method.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"105 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention-injective scale aggregation network for crowd counting","authors":"Haojie Zou, Yingchun Kuang, Jianqiang Luo, Mingwei Yao, Haoyu Zhou, Sha Yang","doi":"10.1117/1.jei.33.5.053008","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053008","url":null,"abstract":"Crowd counting has gained widespread attention in the fields of public safety management, video surveillance, and emergency response. Currently, background interference and scale variation of the head are still intractable problems. We propose an attention-injective scale aggregation network (ASANet) to cope with the above problems. ASANet consists of three parts: shallow feature attention network (SFAN), multi-level feature aggregation (MLFA) module, and density map generation (DMG) network. SFAN effectively overcomes the noise impact of a cluttered background by cross-injecting the attention module in the truncated VGG16 structure. To fully utilize the multi-scale crowd information embedded in the feature layers at different positions, we densely connect the multi-layer feature maps in the MLFA module to solve the scale variation problem. In addition, to capture large-scale head information, the DMG network introduces successive dilated convolutional layers to further expand the receptive field of the model, thus improving the accuracy of crowd counting. We conduct extensive experiments on five public datasets (ShanghaiTech Part_A, ShanghaiTech Part_B, UCF_QNRF, UCF_CC_50, JHU-Crowd++), and the results show that ASANet outperforms most of the existing methods in terms of counting and at the same time demonstrates satisfactory superiority in dealing with background noise in different scenes.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"94 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Infrared and visible image fusion based on global context network","authors":"Yonghong Li, Yu Shi, Xingcheng Pu, Suqiang Zhang","doi":"10.1117/1.jei.33.5.053016","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053016","url":null,"abstract":"Thermal radiation and texture data from two different sensor types are usually combined in the fusion of infrared and visible images for generating a single image. In recent years, convolutional neural network (CNN) based on deep learning has become the mainstream technology for many infrared and visible image fusion methods, which often extracts shallow features and ignores the role of long-range dependencies in the fusion task. However, due to its local perception characteristics, CNN can only obtain global contextual information by continuously stacking convolutional layers, which leads to low network efficiency and difficulty in optimization. To address this issue, we proposed a global context fusion network (GCFN) to model context using a global attention pool, which adopts a two-stage strategy. First, a GCFN-based autoencoder network is trained for extracting multi-scale local and global contextual features. To effectively incorporate the complementary information of the input image, a dual branch fusion network combining CNN and transformer is designed in the second step. Experimental results on a publicly available dataset demonstrate that the proposed method outperforms nine advanced methods in fusion performance on both subjective and objective metrics.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"23 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142258310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generative object separation in X-ray images","authors":"Xiaolong Zheng, Yu Zhou, Jia Yao, Liang Zheng","doi":"10.1117/1.jei.33.5.053004","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053004","url":null,"abstract":"X-ray imaging is essential for security inspection; nevertheless, the penetrability of X-rays can cause objects within a package to overlap in X-ray images, leading to reduced accuracy in manual inspection and increased difficulty in auxiliary inspection techniques. Existing methods mainly focus on object detection to enhance the detection ability of models for overlapping regions by augmenting image features, including color, texture, and semantic information. However, these approaches do not address the underlying issue of overlap. We propose a novel method for separating overlapping objects in X-ray images from the perspective of image inpainting. Specifically, the separation method involves using a vision transformer (ViT) to construct a generative adversarial network (GAN) model that requires a hand-created trimap as input. In addition, we present an end-to-end approach that integrates Mask Region-based Convolutional Neural Network with the separation network to achieve fully automated separation of overlapping objects. Given the lack of datasets appropriate for training separation networks, we created MaskXray, a collection of X-ray images that includes overlapping images, trimap, and individual object images. Our proposed generative separation network was tested in experiments and demonstrated its ability to accurately separate overlapping objects in X-ray images. These results demonstrate the efficacy of our approach and make significant contributions to the field of X-ray image analysis.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"4 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vis-YOLO: a lightweight and efficient image detector for unmanned aerial vehicle small objects","authors":"Xiangyu Deng, Jiangyong Du","doi":"10.1117/1.jei.33.5.053003","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053003","url":null,"abstract":"Yolo series models are extensive within the domain of object detection. Aiming at the challenge of small object detection, we analyze the limitations of existing detection models and propose a Vis-YOLO object detection algorithm based on YOLOv8s. First, the down-sampling times are reduced to retain more features, and the detection head is replaced to adapt to the small object. Then, deformable convolutional networks are used to improve the C2f module, improving its feature extraction ability. Finally, the separation and enhancement attention module is introduced to the model to give more weight to the useful information. Experiments show that the improved Vis-YOLO model outperforms the YOLOv8s model on the visdrone-2019 dataset. The precision improved by 5.4%, the recall by 6.3%, and the mAP50 by 6.8%. Moreover, Vis-YOLO models are smaller and suitable for mobile deployment. This research provides a new method and idea for small object detection, which has excellent potential application value.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"37 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Double-level deep multi-view collaborative learning for image clustering","authors":"Liang Xiao, Wenzhe Liu","doi":"10.1117/1.jei.33.5.053012","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053012","url":null,"abstract":"Multi-view clustering has garnered significant attention due to its ability to explore shared information from multiple views. Applications of multi-view clustering include image and video analysis, bioinformatics, and social network analysis, in which integrating diverse data sources enhances data understanding and insights. However, existing multi-view models suffer from the following limitations: (1) directly extracting latent representations from raw data using encoders is susceptible to interference from noise and other factors and (2) complementary information among different views is often overlooked, resulting in the loss of crucial unique information from each view. Therefore, we propose a distinctive double-level deep multi-view collaborative learning approach. Our method further processes the latent representations learned by the encoder through multiple layers of perceptrons to obtain richer semantic information. In addition, we introduce dual-path guidance at both the feature and label levels to facilitate the learning of complementary information across different views. Furthermore, we introduce pre-clustering methods to guide mutual learning among different views through pseudo-labels. Experimental results on four image datasets (Caltech-5V, STL10, Cifar10, Cifar100) demonstrate that our method achieves state-of-the-art clustering performance, evaluated using standard metrics, including accuracy, normalized mutual information, and purity. We compare our proposed method with existing clustering algorithms to validate its effectiveness.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"6 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"USDAP: universal source-free domain adaptation based on prompt learning","authors":"Xun Shao, Mingwen Shao, Sijie Chen, Yuanyuan Liu","doi":"10.1117/1.jei.33.5.053015","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053015","url":null,"abstract":"Universal source-free domain adaptation (USFDA) aims to explore transferring domain-consistent knowledge in the presence of domain shift and category shift, without access to a source domain. Existing works mainly rely on prior domain-invariant knowledge provided by the source model, ignoring the significant discrepancy between the source and target domains. However, directly utilizing the source model will generate noisy pseudo-labels on the target domain, resulting in erroneous decision boundaries. To alleviate the aforementioned issue, we propose a two-stage USFDA approach based on prompt learning, named USDAP. Primarily, to reduce domain differences, during the prompt learning stage, we introduce a learnable prompt designed to align the target domain distribution with the source. Furthermore, for more discriminative decision boundaries, in the feature alignment stage, we propose an adaptive global-local clustering strategy. This strategy utilizes one-versus-all clustering globally to separate different categories and neighbor-to-neighbor clustering locally to prevent incorrect pseudo-label assignments at cluster boundaries. Based on the above two-stage method, target data are adapted to the classification network under the prompt’s guidance, forming more compact category clusters, thus achieving excellent migration performance for the model. We conduct experiments on various datasets with diverse category shift scenarios to illustrate the superiority of our USDAP.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"105 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weirong Liu, Zhijun Li, Changhong Shi, Xiongfei Jia, Jie Liu
{"title":"Appearance flow based structure prior guided image inpainting","authors":"Weirong Liu, Zhijun Li, Changhong Shi, Xiongfei Jia, Jie Liu","doi":"10.1117/1.jei.33.5.053011","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053011","url":null,"abstract":"Image inpainting techniques based on deep learning have shown significant improvements by introducing structure priors, but still generate structure distortion or textures fuzzy for large missing areas. This is mainly because series networks have inherent disadvantages: employing unreasonable structural priors will inevitably lead to severe mistakes in the second stage of cascade inpainting framework. To address this issue, an appearance flow-based structure prior (AFSP) guided image inpainting is proposed. In the first stage, a structure generator regards edge-preserved smooth images as global structures of images and then appearance flow warps small-scale features in input and flows to corrupted regions. In the second stage, a texture generator using contextual attention is designed to yield image high-frequency details after obtaining reasonable structure priors. Compared with state-of-the-art approaches, the proposed AFSP achieved visually more realistic results. Compared on the Places2 dataset, the most challenging with 1.8 million high-resolution images of 365 complex scenes, shows that AFSP was 1.1731 dB higher than the average peak signal-to-noise ratio for EdgeConnect.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"7 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}