Jeba Nega Cheltha , Chirag Sharma , Deepak Prashar , Arfat Ahmad Khan , Seifedine Kadry
{"title":"Enhanced human motion detection with hybrid RDA-WOA-based RNN and multiple hypothesis tracking for occlusion handling","authors":"Jeba Nega Cheltha , Chirag Sharma , Deepak Prashar , Arfat Ahmad Khan , Seifedine Kadry","doi":"10.1016/j.imavis.2024.105234","DOIUrl":"10.1016/j.imavis.2024.105234","url":null,"abstract":"<div><p>Human motion detection in complex scenarios poses challenges due to occlusions. This paper presents an integrated approach for accurate human motion detections by combining Adapted Canny Edge detection as a preprocessing step, backbone-modified Mask R-CNN for precise segmentation, Hybrid RDA-WOA-based RNN as a classification, and a Multiple-hypothesis model for effective occlusion handling. Adapted Canny Edge detection is utilized as an initial preprocessing step to highlight significant edges in the input image. The resulting edge map enhances object boundaries and highlights structural features, simplifying subsequent processing steps. The improved image is then passed through backbone-modified Mask R-CNN for the pixel-level segmentation of humans. Backbone-modified Mask R-CNN along with IoU, Euclidean Distance, and Z-Score recognizes moving objects in complex scenes exactly. After recognizing moving objects, the optimized Hybrid RDA-WOA-based RNN classifies humans. To handle the self-occlusion, Multiple Hypothesis Tracking (MHT) is used. Real-world situations frequently include occlusions where humans can be partially or completely hidden by objects. The proposed approach integrates a Multiple-hypothesis model into the detection pipeline to address this challenge. Moreover, the proposed human motion detection approach includes an optimized Hybrid RDA-WOA-based RNN trained with 2D representations of 3D skeletal motion. The proposed work was evaluated using the IXMAS, KTH, Weizmann, NTU RGB + D, and UCF101 Datasets. It achieved an accuracy of 98% on the IXMAS, KTH, Weizmann, and UCF101 Datasets and 97.1% on the NTU RGB + D Dataset. The simulation results unveil the superiority of the proposed methodology over the existing approaches.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105234"},"PeriodicalIF":4.2,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaqing Fan , Shenglong Hu , Long Wang , Kaihua Zhang , Bo Liu
{"title":"Dual temporal memory network with high-order spatio-temporal graph learning for video object segmentation","authors":"Jiaqing Fan , Shenglong Hu , Long Wang , Kaihua Zhang , Bo Liu","doi":"10.1016/j.imavis.2024.105208","DOIUrl":"10.1016/j.imavis.2024.105208","url":null,"abstract":"<div><p>Typically, Video Object Segmentation (VOS) always has the semi-supervised setting in the testing phase. The VOS aims to track and segment one or several target objects in the following frames in the sequence, only given the ground-truth segmentation mask in the initial frame. A fundamental issue in VOS is how to best utilize the temporal information to improve the accuracy. To address the aforementioned issue, we provide an end-to-end framework that simultaneously extracts long-term and short-term historical sequential information to current frame as temporal memories. The integrated temporal architecture consists of a short-term and a long-term memory modules. Specifically, the short-term memory module leverages a high-order graph-based learning framework to simulate the fine-grained spatial–temporal interactions between local regions across neighboring frames in a video, thereby maintaining the spatio-temporal visual consistency on local regions. Meanwhile, to relieve the occlusion and drift issues, the long-term memory module employs a Simplified Gated Recurrent Unit (S-GRU) to model the long evolutions in a video. Furthermore, we design a novel direction-aware attention module to complementarily augment the object representation for more robust segmentation. Our experiments on three mainstream VOS benchmarks, containing DAVIS 2017, DAVIS 2016, and Youtube-VOS, demonstrate that our proposed solution provides a fair tradeoff performance between both speed and accuracy.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105208"},"PeriodicalIF":4.2,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142058536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qian Ye , Yaqin Zhou , Guanying Huo , Yan Liu , Yan Zhou , Qingwu Li
{"title":"Reverse cross-refinement network for camouflaged object detection","authors":"Qian Ye , Yaqin Zhou , Guanying Huo , Yan Liu , Yan Zhou , Qingwu Li","doi":"10.1016/j.imavis.2024.105218","DOIUrl":"10.1016/j.imavis.2024.105218","url":null,"abstract":"<div><p>Due to the high intrinsic similarity between camouflaged objects and the background, camouflaged objects often exhibit blurred boundaries, making it challenging to distinguish the boundaries of objects. Existing methods still focus on the overall regional accuracy but not on the boundary quality and are not competent to identify camouflaged objects from the background in complex scenarios. Thus, we propose a novel reverse cross-refinement network called RCR-Net. Specifically, we design a diverse feature enhancement module that simulates the correspondingly expanded receptive fields of the human visual system by using convolutional kernels with different dilation rates in parallel. Also, the boundary attention module is used to reduce the noise of the bottom features. Moreover, a multi-scale feature aggregation module is proposed to transmit the diverse features from pixel-level camouflaged edges to the entire camouflaged object region in a coarse-to-fine manner, which consists of reverse guidance, group guidance, and position guidance. Reverse guidance mines complementary regions and details by erasing already estimated object regions. Group guidance and position guidance integrate different features through simple and effective splitting and connecting operations. Extensive experiments show that RCR-Net outperforms the existing 18 state-of-the-art methods on four widely-used COD datasets. Especially, compared with the existing top-1 model HitNet, RCR-Net significantly improves the performance by ∼<!--> <!-->16.4% (Mean Absolute Error) on the CAMO dataset, showing that RCR-Net could accurately detect camouflaged objects.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105218"},"PeriodicalIF":4.2,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142040985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinglin Tong , Junjie Zhang , Chenggang Yan , Dan Zeng
{"title":"A streamlined framework for BEV-based 3D object detection with prior masking","authors":"Qinglin Tong , Junjie Zhang , Chenggang Yan , Dan Zeng","doi":"10.1016/j.imavis.2024.105229","DOIUrl":"10.1016/j.imavis.2024.105229","url":null,"abstract":"<div><p>In the field of autonomous driving, perception tasks based on Bird's-Eye-View (BEV) have attracted considerable research attention due to their numerous benefits. Despite recent advancements in performance, efficiency remains a challenge for real-world implementation. In this study, we propose an efficient and effective framework that constructs a spatio-temporal BEV feature from multi-camera inputs and leverages it for 3D object detection. Specifically, the success of our network is primarily attributed to the design of the lifting strategy and a tailored BEV encoder. The lifting strategy is tasked with the conversion of 2D features into 3D representations. In the absence of depth information in the images, we innovatively introduce a prior mask for the BEV feature, which can assess the significance of the feature along the camera ray at a low cost. Moreover, we design a lightweight BEV encoder, which significantly boosts the capacity of this physical-interpretation representation. In the encoder, we investigate the spatial relationships of the BEV feature and retain rich residual information from upstream. To further enhance performance, we establish a 2D object detection auxiliary head to delve into insights offered by 2D object detection and leverage the 4D information to explore the cues within the sequence. Benefiting from all these designs, our network can capture abundant semantic information from 3D scenes and strikes a balanced trade-off between efficiency and performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105229"},"PeriodicalIF":4.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142117630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Black-box model adaptation for semantic segmentation","authors":"Zhiheng Zhou , Wanlin Yue , Yinglie Cao , Shifu Shen","doi":"10.1016/j.imavis.2024.105233","DOIUrl":"10.1016/j.imavis.2024.105233","url":null,"abstract":"<div><p>Model adaptation aims to transfer knowledge in pre-trained source models to a new unlabeled dataset. Despite impressive progress, prior methods always need to access the source model and develop data-reconstruction approaches to align the data distributions between target samples and the generated instances, which may raise privacy concerns from source individuals. To alleviate the above problem, we propose a new method in the setting of Black-box model adaptation for semantic segmentation, in which only the pseudo-labels from multiple source domain is required during the adaptation process. Specifically, the proposed method structurally distills the knowledge with multiple classifiers to obtain a customized target model, and then the predictions of target data are refined to fit the target domain with co-regularization. We conduct extensive experiments on several standard datasets, and our method can achieve promising results.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105233"},"PeriodicalIF":4.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-scale large kernel convolution and hybrid attention network for remote sensing image dehazing","authors":"Hang Su, Lina Liu, Zenghui Wang, Mingliang Gao","doi":"10.1016/j.imavis.2024.105212","DOIUrl":"10.1016/j.imavis.2024.105212","url":null,"abstract":"<div><p>Remote sensing (RS) image dehazing holds significant importance in enhancing the quality and information extraction capability of RS imagery. The enhancement in image dehazing quality has progressively advanced alongside the evolution of convolutional neural network (CNN). Due to the fixed receptive field of CNN, there is insufficient utilization of contextual information on haze features in multi-scale RS images. Additionally, the network fails to adequately extract both local and global information of haze features. In addressing the above problems, in this paper, we propose an RS image dehazing network based on multi-scale large kernel convolution and hybrid attention (MKHANet). The network is mainly composed of multi-scale large kernel convolution (MSLKC) module, hybrid attention (HA) module and feature fusion attention (FFA) module. The MSLKC module fully fuses the multi-scale information of features while enhancing the effective receptive field of the network by parallel multiple large kernel convolutions. To alleviate the problem of uneven distribution of haze and effectively extract the global and local information of haze features, the HA module is introduced by focusing on the importance of haze pixels at the channel level. The FFA module aims to boost the interaction of feature information between the network's deep and shallow layers. The subjective and objective experimental results on on multiple RS hazy image datasets illustrates that MKHANet surpasses existing state-of-the-art (SOTA) approaches. The source code is available at <span><span>https://github.com/tohang98/MKHA_Net</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105212"},"PeriodicalIF":4.2,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Triplet-set feature proximity learning for video anomaly detection","authors":"Kuldeep Marotirao Biradar , Murari Mandal , Sachin Dube , Santosh Kumar Vipparthi , Dinesh Kumar Tyagi","doi":"10.1016/j.imavis.2024.105205","DOIUrl":"10.1016/j.imavis.2024.105205","url":null,"abstract":"<div><p>The identification of anomalies in videos is a particularly complex visual challenge, given the wide variety of potential real-world events. To address this issue, our paper introduces a unique approach for detecting divergent behavior in surveillance videos, utilizing triplet-loss for video anomaly detection. Our method involves selecting a triplet set of video segments from normal (n) and abnormal (a) data points for deep feature learning. We begin by creating a database of triplet sets of two types: a-a-n and n-n-a. By computing a triplet loss, we model the proximity between n-n chunks and the distance between ‘a’ chunks from the n-n ones. Additionally, we train the deep network to model the closeness of a-a chunks and the divergent behavior of ‘n’ from the a-a chunks.</p><p>The model acquired in the initial stage can be viewed as a prior, which is subsequently employed for modeling normality. As a result, our method can leverage the advantages of both straightforward classification and normality modeling-based techniques. We also present a data selection mechanism for the efficient generation of triplet sets. Furthermore, we introduce a novel video anomaly dataset, AnoVIL, designed for human-centric anomaly detection. Our proposed method is assessed using the UCF-Crime dataset encompassing all 13 categories, the IIT-H accident dataset, and AnoVIL. The experimental findings demonstrate that our method surpasses the current state-of-the-art approaches. We conduct further evaluations of the performance, considering various configurations such as cross-dataset evaluation, loss functions, siamese structure, and embedding size. Additionally, an ablation study is carried out across different settings to provide insights into our proposed method.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105205"},"PeriodicalIF":4.2,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141984776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SDMNet: Spatially dilated multi-scale network for object detection for drone aerial imagery","authors":"Neeraj Battish , Dapinder Kaur , Moksh Chugh , Shashi Poddar","doi":"10.1016/j.imavis.2024.105232","DOIUrl":"10.1016/j.imavis.2024.105232","url":null,"abstract":"<div><p>Multi-scale object detection is a preeminent challenge in computer vision and image processing. Several deep learning models that are designed to detect various objects miss out on the detection capabilities for small objects, reducing their detection accuracies. Intending to focus on different scales, from extremely small to large-sized objects, this work proposes a Spatially Dilated Multi-Scale Network (SDMNet) architecture for UAV-based ground object detection. It proposes a Multi-scale Enhanced Effective Channel Attention mechanism to preserve the object details in the images. Additionally, the proposed model incorporates dilated convolution, sub-pixel convolution, and additional prediction heads to enhance object detection performance specifically for aerial imaging. It has been evaluated on two popular aerial image datasets, VisDrone 2019 and UAVDT, containing publicly available annotated images of ground objects captured from UAV. Different performance metrics, such as precision, recall, mAP, and detection rate, benchmark the proposed architecture with the existing object detection approaches. The experimental results demonstrate the effectiveness of the proposed model for multi-scale object detection with an average precision score of 54.2% and 98.4% for VisDrone and UAVDT datasets, respectively.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105232"},"PeriodicalIF":4.2,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Zhai, Yun Chen, Yao Wang, Yuncan Ouyang, Zhi Zeng
{"title":"W-shaped network combined with dual transformers and edge protection for multi-focus image fusion","authors":"Hao Zhai, Yun Chen, Yao Wang, Yuncan Ouyang, Zhi Zeng","doi":"10.1016/j.imavis.2024.105210","DOIUrl":"10.1016/j.imavis.2024.105210","url":null,"abstract":"<div><p>In this paper, a W-shaped network combined with dual transformers and edge protection is proposed for multi-focus image fusion. Different from the traditional Convolutional Neural Network (CNN) fusion method, a heterogeneous encoder network framework is designed for feature extraction, and a decoder is used for feature reconstruction. The purpose of this design is to preserve the local details and edge information of the source image to the maximum extent possible. Specifically, the first encoder uses adaptive average pooling to downsample the source image and extract important features from it. The source image pair for edge detection using the Gaussian Modified Laplace Operator (GMLO) is used as input for the second encoder, and adaptive maximum pooling is employed for downsampling. In addition, the encoder part of the network combines CNN and Transformer to extract both local and global features. By reconstructing the extracted feature information, the final fusion image is obtained. To evaluate the performance of this method, we compared 16 recent multi-focus image fusion methods and conducted qualitative and quantitative analyses. Experimental results on public datasets such as Lytro, MFFW, MFI-WHU, and the real scene dataset HBU-CVMDSP demonstrate that our method can accurately identify the focused and defocused regions of source images. It also preserves the edge details of the source images while extracting the focused regions.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105210"},"PeriodicalIF":4.2,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141993030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OCUCFormer: An Over-Complete Under-Complete Transformer Network for accelerated MRI reconstruction","authors":"Mohammad Al Fahim , Sriprabha Ramanarayanan , G.S. Rahul , Matcha Naga Gayathri , Arunima Sarkar , Keerthi Ram , Mohanasankar Sivaprakasam","doi":"10.1016/j.imavis.2024.105228","DOIUrl":"10.1016/j.imavis.2024.105228","url":null,"abstract":"<div><p>Many deep learning-based architectures have been proposed for accelerated Magnetic Resonance Imaging (MRI) reconstruction. However, existing encoder-decoder-based popular networks have a few shortcomings: (1) They focus on the anatomy structure at the expense of fine details, hindering their performance in generating faithful reconstructions; (2) Lack of long-range dependencies yields sub-optimal recovery of fine structural details. In this work, we propose an Over-Complete Under-Complete Transformer network (OCUCFormer) which focuses on better capturing fine edges and details in the image and can extract the long-range relations between these features for improved single-coil (SC) and multi-coil (MC) MRI reconstruction. Our model computes long-range relations in the highest resolutions using Restormer modules for improved acquisition and restoration of fine anatomical details. Towards learning in the absence of fully sampled ground truth for supervision, we show that our model trained with under-sampled data in a self-supervised fashion shows a superior recovery of fine structures compared to other works. We have extensively evaluated our network for SC and MC MRI reconstruction on brain, cardiac, and knee anatomies for <span><math><mn>4</mn><mo>×</mo></math></span> and <span><math><mn>5</mn><mo>×</mo></math></span> acceleration factors. We report significant improvements over popular deep learning-based methods when trained in supervised and self-supervised modes. We have also performed experiments demonstrating the strengths of extracting fine details and the anatomical structure and computing long-range relations within over-complete representations. Code for our proposed method is available at: <span><span><span>https://github.com/alfahimmohammad/OCUCFormer-main</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"150 ","pages":"Article 105228"},"PeriodicalIF":4.2,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141997841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}