{"title":"Multi-head attention with reinforcement learning for supervised video summarization","authors":"Bhakti Deepak Kadam, Ashwini Mangesh Deshpande","doi":"10.1117/1.jei.33.5.053010","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053010","url":null,"abstract":"With the substantial surge in available internet video data, the intricate task of video summarization has consistently attracted the computer vision research community to summarize the videos meaningfully. Many recent summarization techniques leverage bidirectional long short-term memory for its proficiency in modeling temporal dependencies. However, its effectiveness is limited to short-duration video clips, typically up to 90 to 100 frames. To address this constraint, the proposed approach incorporates global and local multi-head attention, effectively capturing temporal dependencies at both global and local levels. This enhancement enables parallel computation, thereby improving overall performance for longer videos. This work considers video summarization as a supervised learning task and introduces a deep summarization architecture called multi-head attention with reinforcement learning (MHA-RL). The architecture comprises a pretrained convolutional neural network for extracting features from video frames, along with global and local multi-head attention mechanisms for predicting frame importance scores. Additionally, the network integrates an RL-based regressor network to consider the diversity and representativeness of the generated video summary. Extensive experimentation is conducted on benchmark datasets, such as TVSum and SumMe. The proposed method exhibits improved performance compared to the majority of state-of-the-art summarization techniques, as indicated by both qualitative and quantitative results.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"31 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End-to-end multitasking network for smart container product positioning and segmentation","authors":"Wenzhong Shen, Xuejian Cai","doi":"10.1117/1.jei.33.5.053009","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053009","url":null,"abstract":"The current smart cooler’s commodity identification system first locates the item being purchased, followed by feature extraction and matching. However, this method often suffers from inaccuracies due to the presence of background in the detection frame, leading to missed detections and misidentifications. To address these issues, we propose an end-to-end You Only Look Once (YOLO) for detection and segmentation algorithm. In the backbone network, we combine deformable convolution with a channel-to-pixel (C2f) module to enhance the model feature extraction capability. In the neck network, we introduce an optimized feature fusion structure, which is based on the weighted bi-directional feature pyramid. To further enhance the model’s understanding of both global and local context, a triple feature encoding module is employed, seamlessly fusing multi-scale features for improved performance. The convolutional block attention module is connected to the improved C2f module to enhance the network’s attention to the commodity image channel and spatial information. A supplementary segmentation branch is incorporated into the head of the network, allowing it to not only detect targets within the image but also generate precise segmentation masks for each detected object, thereby enhancing its multi-task capabilities. Compared with YOLOv8, for box and mask, the precision increases by 3% and 4.7%, recall increases by 2.8% and 4.7%, and mean average precision (mAP) increases by 4.9% and 14%. The frames per second is 119, which meets the demand for real-time detection. The results of comparative and ablation studies confirm the high accuracy and performance of the proposed algorithm, solidifying its foundation for fine-grained commodity identification.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"27 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient attention-based networks for fire and smoke detection","authors":"Bowei Xiao, Chunman Yan","doi":"10.1117/1.jei.33.5.053014","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053014","url":null,"abstract":"To address limitations in current flame and smoke detection models, including difficulties in handling irregularities, occlusions, large model sizes, and real-time performance issues, this work introduces FS-YOLO, a lightweight attention-based model. FS-YOLO adopts an efficient architecture for feature extraction capable of capturing long-range information, overcoming issues of redundant data and inadequate global feature extraction. The model incorporates squeeze-enhanced-axial-C2f to enhance global information capture without significantly increasing computational demands. Additionally, the improved VoVNet-GSConv-cross stage partial network refines semantic information from higher-level features, reducing missed detections and maintaining a lightweight model. Compared to YOLOv8n, FS-YOLO achieves a 1.4% increase and a 1.0% increase in mAP0.5 and mAP0.5:0.95, respectively, along with a 1.3% improvement in precision and a 1.0% boost in recall. These enhancements make FS-YOLO a promising solution for flame and smoke detection, balancing accuracy and efficiency effectively.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"20 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DTSIDNet: a discrete wavelet and transformer based network for single image denoising","authors":"Cong Hu, Yang Qu, Yuan-Bo Li, Xiao-Jun Wu","doi":"10.1117/1.jei.33.5.053007","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053007","url":null,"abstract":"Recent advancements in transformer architectures have significantly enhanced image-denoising algorithms, surpassing the limitations of traditional convolutional neural networks by more effectively modeling global interactions through advanced attention mechanisms. In the domain of single-image denoising, noise manifests across various scales. This is especially evident in intricate scenarios, necessitating the comprehensive capture of multi-scale information inherent in the image. To solve transformer’s lack of multi-scale image analysis capability, a discrete wavelet and transformer based network (DTSIDNet) is proposed. The network adeptly resolves the inherent limitations of the transformer architecture by integrating the discrete wavelet transform. DTSIDNet independently manages image data at various scales, which greatly improves both adaptability and efficiency in environments with complex noise. The network’s self-attention mechanism dynamically shifts focus among different scales, efficiently capturing an extensive array of image features, thereby significantly enhancing the denoising outcome. This approach not only boosts the precision of denoising but also enhances the utilization of computational resources, striking an optimal balance between efficiency and high performance. Experiments on real-world and synthetic noise scenarios show that DTSIDNet delivers high image quality with low computational demands, indicating its superior performance in denoising tasks with efficient resource use.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"37 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward effective local dimming-driven liquid crystal displays: a deep curve estimation–based adaptive compensation solution","authors":"Tianshan Liu, Kin-Man Lam","doi":"10.1117/1.jei.33.5.053005","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053005","url":null,"abstract":"Local backlight dimming (LBD) is a promising technique for improving the contrast ratio and saving power consumption for liquid crystal displays. LBD consists of two crucial parts, i.e., backlight luminance determination, which locally controls the luminance of each sub-block of the backlight unit (BLU), and pixel compensation, which compensates for the reduction of pixel intensity, to achieve pleasing visual quality. However, the limitations of the current deep learning–based pixel compensation methods come from two aspects. First, it is difficult for a vanilla image-to-image translation strategy to learn the mapping relations between the input image and the compensated image, especially without considering the dimming levels. Second, the extensive model parameters make these methods hard to be deployed in industrial applications. To address these issues, we reformulate pixel compensation as an input-specific curve estimation task. Specifically, a deep lightweight network, namely, the curve estimation network (CENet), takes both the original input image and the dimmed BLUs as input, to estimate a set of high-order curves. Then, these curves are applied iteratively to adjust the intensity of each pixel to obtain a compensated image. Given the determined BLUs, the proposed CENet can be trained in an end-to-end manner. This implies that our proposed method is compatible with any backlight dimming strategies. Extensive evaluation results on the DIVerse 2K (DIV2K) dataset highlight the superiority of the proposed CENet-integrated local dimming framework, in terms of model size and visual quality of displayed content.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"20 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SMLoc: spatial multilayer perception-guided camera localization","authors":"Jingyuan Feng, Shengsheng Wang, Haonan Sun","doi":"10.1117/1.jei.33.5.053013","DOIUrl":"https://doi.org/10.1117/1.jei.33.5.053013","url":null,"abstract":"Camera localization is a technique for obtaining the camera’s six degrees of freedom using the camera as a sensor input. It is widely used in augmented reality, autonomous driving, virtual reality, etc. In recent years, with the development of deep-learning technology, absolute pose regression has gained wide attention as an end-to-end learning-based localization method. The typical architecture is constructed by a convolutional backbone and a multilayer perception (MLP) regression header composed of multiple fully connected layers. Typically, the two-dimensional feature maps extracted by the convolutional backbone have to be flattened and passed into the fully connected layer for pose regression. However, this operation will result in the loss of crucial pixel position information carried by the two-dimensional feature map and adversely affect the accuracy of the pose estimation. We propose a parallel structure, termed SMLoc, using a spatial MLP to aggregate position and orientation information from feature maps, respectively, reducing the loss of pixel position information. Our approach achieves superior performance on common indoor and outdoor datasets.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"734 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Abhimanyu Kumar Patro, Pulkit Singh, Narendra Khatri, Bibhudendra Acharya
{"title":"Chaotic multiple-image encryption scheme: a simple and highly efficient solution for diverse applications","authors":"K. Abhimanyu Kumar Patro, Pulkit Singh, Narendra Khatri, Bibhudendra Acharya","doi":"10.1117/1.jei.33.4.043032","DOIUrl":"https://doi.org/10.1117/1.jei.33.4.043032","url":null,"abstract":"A multitude of confidential and personal digital images are commonly stored and transmitted by devices with limited resources. These devices necessitate the implementation of uncomplicated yet highly efficient encryption techniques to safeguard the images. The challenge of designing encryption algorithms for multiple digital images that are simple, secure, and highly efficient is significant. This challenge arises due to the large quantity of images involved and the considerable size and strong inter-pixel associations exhibited by these digital images. We propose a method for efficiently, simply, and securely encrypting multiple images simultaneously using chaotic one-dimensional (1D) maps. Initially, each grayscale image is consolidated into a single, substantial image. Through transpose columnar transposition and bit-XOR diffusion procedures, each block undergoes parallel permutation and diffusion. The incorporation of parallel permutation and diffusion functions accelerates and enhances the performance of the method. In contrast to existing multi-image encryption methods, the proposed approach consistently employs a single 1D chaotic map, rendering the algorithm both software and hardware efficient while maintaining simplicity. The encryption technique adheres to general requirements for simplicity and high efficiency. Security analysis and simulation results demonstrate that the proposed method is straightforward, highly efficient, and effectively enhances the security of cipher images.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"50 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FCCText: frequency-color complementary bistream structure for scene text detection","authors":"Ruiyi Han, Xin Li","doi":"10.1117/1.jei.33.4.043037","DOIUrl":"https://doi.org/10.1117/1.jei.33.4.043037","url":null,"abstract":"Current scene text detection methods mainly employ RGB domain information for text localization, and their performance has not been fully exploited in many challenging scenes. Considering that the RGB features of text and background in complex environments are subtle and more discernible in the frequency domain, we consider that the frequency-domain information can effectively complement the RGB-domain features, collectively enhancing text detection capabilities. To this end, we propose a network with complementary frequency-domain semantic and color features, called the bistream structure, to facilitate text detection in scenes characterized by a wide variety of complex patterns. Our approach utilizes a frequency perception module (FPM) that converts features extracted by the backbone into the frequency domain to enhance the ability to distinguish the text from the complex background, thereby achieving coarse localization of texts. This innovation utilizes frequency-domain features to efficiently reveal text structures obscured by background noise in the RGB domain, resulting in a sharper differentiation between text and background elements in challenging scenarios. Moreover, we propose a complementary correction module that guides the fusion of multi-level RGB features through the coarse localization results, progressively refining the segmentation results to achieve the correction of the frequency domain features. Extensive experiments on the Total-Text, CTW1500, and MSRA-TD500 datasets demonstrate that our method achieves outstanding performance in scene text detection.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"30 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning","authors":"Qianyao Shi, Wanru Xu, Zhenjiang Miao","doi":"10.1117/1.jei.33.4.043042","DOIUrl":"https://doi.org/10.1117/1.jei.33.4.043042","url":null,"abstract":"Nowadays, we are surrounded by various types of data from different modalities, such as text, images, audio, and video. The existence of this multimodal data provides us with rich information, but it also brings new challenges: how do we effectively utilize this data for accurate classification? This is the main problem faced by multimodal classification tasks. Multimodal classification is an important task that aims to classify data from different modalities. However, due to the different characteristics and structures of data from different modalities, effectively fusing and utilizing them for classification is a challenging problem. To address this issue, we propose a cross-attention contextual transformer with modality-collaborative learning for multimodal classification (CACT-MCL-MMC) to better integrate information from different modalities. On the one hand, existing multimodal fusion methods ignore the intra- and inter-modality relationships, and there is unnoticed information in the modalities, resulting in unsatisfactory classification performance. To address the problem of insufficient interaction of modality information in existing algorithms, we use a cross-attention contextual transformer to capture the contextual relationships within and among modalities to improve the representativeness of the model. On the other hand, due to differences in the quality of information among different modalities, some modalities may have misleading or ambiguous information. Treating each modality equally may result in modality perceptual noise, which reduces the performance of multimodal classification. Therefore, we use modality-collaborative to filter misleading information, alleviate the quality difference of information among modalities, align modality information with high-quality and effective modalities, enhance unimodal information, and obtain more ideal multimodal fusion information to improve the model’s discriminative ability. Our comparative experimental results on two benchmark datasets for image-text classification, CrisisMMD and UPMC Food-101, show that our proposed model outperforms other classification methods and even state-of-the-art (SOTA) multimodal classification methods. Meanwhile, the effectiveness of the cross-attention module, multimodal contextual attention network, and modality-collaborative learning was verified through ablation experiments. In addition, conducting hyper-parameter validation experiments showed that different fusion calculation methods resulted in differences in experimental results. The most effective feature tensor calculation method was found. We also conducted qualitative experiments. Compared with the original model, our proposed model can identify the expected results in the vast majority of cases. The codes are available at https://github.com/KobeBryant8-24-MVP/CACT-MCL-MMC. The CrisisMMD is available at https://dataverse.mpisws.org/dataverse/icwsm18, and the UPMC-Food-101 is available at https://vi","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"2 4 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SGTformer: improved Shifted Window Transformer network for white blood cell subtype classification","authors":"Xiangyu Deng, Lihao Pan, Zhiyan Dang","doi":"10.1117/1.jei.33.4.043057","DOIUrl":"https://doi.org/10.1117/1.jei.33.4.043057","url":null,"abstract":"White blood cells are a core component of the immune system, responsible for protecting the human body from foreign invaders and infectious diseases. A decrease in the white blood cell count can lead to weakened immune function, increasing the risk of infection and illness. However, determining the number of white blood cells usually requires the expertise and effort of radiologists. In recent years, with the development of image processing technology, biomedical systems have widely applied image processing techniques in disease diagnosis. We aim to classify the subtypes of white blood cells using image processing technology. To improve the ability to extract fine information during the feature extraction process, the spatial prior convolutional attention (SPCA) module is proposed. In addition, to enhance the connection between features at distant distances, the Shifted Window (Swin) Transformer network is used as the backbone for feature extraction. The SGTformer network for white blood cell subtype classification is proposed by combining recursive gate convolution and SPCA modules. Our method is validated on the white blood cell dataset, and the experimental results demonstrate an overall accuracy of 99.47% in white blood cell classification, surpassing existing mainstream classification algorithms. It is evident that this method can effectively accomplish the task of white blood cell classification and provide robust support for the health of the immune system.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"55 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142202438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}