Pablo Garcia-Fernandez, Daniel Cores, Manuel Mucientes
{"title":"Enhancing few-shot object detection through pseudo-label mining","authors":"Pablo Garcia-Fernandez, Daniel Cores, Manuel Mucientes","doi":"10.1016/j.imavis.2024.105379","DOIUrl":"10.1016/j.imavis.2024.105379","url":null,"abstract":"<div><div>Few-shot object detection involves adapting an existing detector to a set of unseen categories with few annotated examples. This data limitation makes these methods to underperform those trained on large labeled datasets. In many scenarios, there is a high amount of unlabeled data that is never exploited. Thus, we propose to e<strong>xPAND</strong> the initial novel set by mining pseudo-labels. From a raw set of detections, xPAND obtains reliable pseudo-labels suitable for training any detector. To this end, we propose two new modules: Class and Box confirmation. Class Confirmation aims to remove misclassified pseudo-labels by comparing candidates with expected class prototypes. Box Confirmation estimates IoU to discard inadequately framed objects. Experimental results demonstrate that xPAND enhances the performance of multiple detectors up to +5.9 nAP and +16.4 nAP50 points for MS-COCO and PASCAL VOC, respectively, establishing a new state of the art. Code: <span><span>https://github.com/PAGF188/xPAND</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105379"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongjuan Pei , Jiaying Chen , Shihao Gao , Taisong Jin , Ke Lu
{"title":"Skeleton action recognition via group sparsity constrained variant graph auto-encoder","authors":"Hongjuan Pei , Jiaying Chen , Shihao Gao , Taisong Jin , Ke Lu","doi":"10.1016/j.imavis.2025.105426","DOIUrl":"10.1016/j.imavis.2025.105426","url":null,"abstract":"<div><div>Human skeleton action recognition has garnered significant attention from researchers due to its promising performance in real-world applications. Recently, graph neural networks (GNNs) have been applied to this field, with graph convolution networks (GCNs) being commonly utilized to modulate the spatial configuration and temporal dynamics of joints. However, the GCN-based paradigm for skeleton action recognition fails to recognize and disentangle the heterogeneous factors of action representation. Consequently, the learned action features are susceptible to irrelevant factors, hindering further performance enhancement. To address this issue and learn a disentangled action representation, we propose a novel skeleton action recognition method, termed <span><math><mi>β</mi></math></span>-bVGAE. The proposed method leverages group sparsity constrained Variant graph auto-encoder, rather than graph convolutional networks, to learn the discriminative features of the skeleton sequence. Extensive experiments conducted on benchmark action recognition datasets demonstrate that our proposed method outperforms existing GCN-based skeleton action recognition methods, highlighting the significant potential of the variant auto-encoder architecture in the field of skeleton action recognition.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105426"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naigong Yu, YiFan Fu, QiuSheng Xie, QiMing Cheng, Mohammad Mehedi Hasan
{"title":"Feature extraction and fusion algorithm for infrared visible light images based on residual and generative adversarial network","authors":"Naigong Yu, YiFan Fu, QiuSheng Xie, QiMing Cheng, Mohammad Mehedi Hasan","doi":"10.1016/j.imavis.2024.105346","DOIUrl":"10.1016/j.imavis.2024.105346","url":null,"abstract":"<div><div>With the application and popularization of depth cameras, image fusion techniques based on infrared and visible light are increasingly used in various fields. Object detection and robot navigation impose more stringent requirements on the texture details and image quality of fused images. Existing residual network, attention mechanisms, and generative adversarial network are ineffective in dealing with the image fusion problem because of insufficient detail feature extraction and non-conformity to the human visual perception system during the fusion of infrared and visible light images. Our newly developed RGFusion network relies on a two-channel attentional mechanism, a residual network, and a generative adversarial network that introduces two new components: a high-precision image feature extractor and an efficient multi-stage training strategy. The network is preprocessed by a high-dimensional mapping and the complex feature extractor is processed through a sophisticated two-stage image fusion process to obtain feature structures with multiple features, resulting in high-quality fused images rich in detailed features. Extensive experiments on public datasets validate this fusion approach, and RGFusion is at the forefront of SD metrics for EN and SF, reaching 7.366, 13.322, and 49.281 on the TNO dataset and 7.276, 19.171, and 53.777 on the RoadScene dataset, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105346"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EDCAANet: A lightweight COD network based on edge detection and coordinate attention assistance","authors":"Qing Pan, Xiayuan Feng, Nili Tian","doi":"10.1016/j.imavis.2024.105382","DOIUrl":"10.1016/j.imavis.2024.105382","url":null,"abstract":"<div><div>In order to obtain the higher efficiency and the more accuracy in camouflaged object detection (COD), a lightweight COD network based on edge detection and coordinate attention assistance (EDCAANet) is presented in this paper. Firstly, an Integrated Edge and Global Context Information Module (IEGC) is proposed, which uses edge detection as an auxiliary means to collaborate with the atrous spatial convolution pooling pyramid (ASPP) for obtaining global context information to achieve the preliminary positioning of the camouflaged object. Then, the Receptive Field Module based on Coordinate Attention (RFMC) is put forward, in which the Coordinate Attention (CA) mechanism is employed as another aid means to expand receptive ffeld features and then achieve global comprehensive of the image. In the final stage of feature fusion, the proposed lightweight Adjacent and Global Context Focusing module (AGCF) is employed to aggregate the multi-scale semantic features output by RFMC at adjacent levels and the global context features output by IEGC. These aggregated features are mainly refined by the proposed Multi Scale Convolutional Aggregation (MSDA) blocks in the module, allowing features to interact and combine at various scales to ultimately produce prediction results. The experiments include performance comparison experiment, testing in complex background, generalization experiment, as well as ablation experiment and complexity analysis. Four public datasets are adopted for experiments, four recognized COD metrics are employed for performance evaluation, 3 backbone networks and 18 methods are used for comparison. The experimental results show that the proposed method can obtain both the more excellent detection performance and the higher efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105382"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yehui Wang , Fang Lei , Baoyan Wang , Qiang Zhang , Xiantong Zhen , Lei Zhang
{"title":"De-noising mask transformer for referring image segmentation","authors":"Yehui Wang , Fang Lei , Baoyan Wang , Qiang Zhang , Xiantong Zhen , Lei Zhang","doi":"10.1016/j.imavis.2024.105356","DOIUrl":"10.1016/j.imavis.2024.105356","url":null,"abstract":"<div><div>Referring Image Segmentation (RIS) is a challenging computer vision task that involves identifying and segmenting specific objects in an image based on a natural language description. Unlike conventional segmentation methodologies, RIS needs to bridge the gap between visual and linguistic modalities to exert the semantic information provided by natural language. Most existing RIS approaches are confronted with the common issue that the intermediate predicted target region also participates in the later feature generation and parameter updating. Then the wrong prediction, especially occurs in the early training stage, will bring the gradient misleading and ultimately affect the training stability. To tackle this issue, we propose de-noising mask (DNM) transformer to fuse the cross-modal integration, a novel framework to replace the cross-attention by DNM-attention in traditional transformer. Furthermore, two kinds of DNM-attention, named mask-DNM and cluster-DNM, are proposed, where noisy ground truth information is adopted to guide the attention mechanism to produce accurate object queries, <em>i.e.</em>, de-nosing query. Thus, DNM-attention leverages noisy ground truth information to guide the attention mechanism to produce additional de-nosing queries, which effectively avoids the gradient misleading. Experimental results show that the DNM transformer improves the performance of RIS and outperforms most existing RIS approaches on three benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105356"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alireza Esmaeilzehi , Amir Mohammad Babaei , Farshid Nooshi , Hossein Zaredar , M. Omair Ahmad
{"title":"CLBSR: A deep curriculum learning-based blind image super resolution network using geometrical prior","authors":"Alireza Esmaeilzehi , Amir Mohammad Babaei , Farshid Nooshi , Hossein Zaredar , M. Omair Ahmad","doi":"10.1016/j.imavis.2024.105364","DOIUrl":"10.1016/j.imavis.2024.105364","url":null,"abstract":"<div><div>Blind image super resolution (SR) is a challenging computer vision task, which involves enhancing the quality of the low-resolution (LR) images obtained by various degradation operations. Deep neural networks have provided state-of-the-art performances for the task of image SR in a blind fashion. It has been shown in the literature that by decoupling the task of blind image SR into the blurring kernel estimation and high-quality image reconstruction, superior performance can be obtained. In this paper, we first propose a novel optimization problem that, by using the geometrical information as prior, is able to estimate the blurring kernels in an accurate manner. We then propose a novel blind image SR network that employs the blurring kernel thus estimated in its network architecture and learning algorithm in order to generate high-quality images. In this regard, we utilize the curriculum learning strategy, wherein the training process of the SR network is initially facilitated by using the ground truth (GT) blurring kernel and then continued with the estimated blurring kernel obtained from our optimization problem. The results of various experiments show the effectiveness of the proposed blind image SR scheme in comparison to state-of-the-art methods on various degradation operations and benchmark datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105364"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muxin Liao , Shishun Tian , Yuhang Zhang , Guoguang Hua , Rong You , Wenbin Zou , Xia Li
{"title":"Class-discriminative domain generalization for semantic segmentation","authors":"Muxin Liao , Shishun Tian , Yuhang Zhang , Guoguang Hua , Rong You , Wenbin Zou , Xia Li","doi":"10.1016/j.imavis.2024.105393","DOIUrl":"10.1016/j.imavis.2024.105393","url":null,"abstract":"<div><div>Existing domain generalization semantic segmentation methods aim to improve the generalization ability by learning domain-invariant information for generalizing well on unseen domains. However, these methods ignore the class discriminability of models, which may lead to a class confusion problem. In this paper, a class-discriminative domain generalization (CDDG) approach is proposed to simultaneously alleviate the distribution shift and class confusion for semantic segmentation. Specifically, a dual prototypical contrastive learning module is proposed. Since the high-frequency component is consistent across different domains, a class-text-guided high-frequency prototypical contrastive learning is proposed. It uses text embeddings as prior knowledge for guiding the learning of high-frequency prototypical representation from high-frequency components to mine domain-invariant information and further improve the generalization ability. However, the domain-specific information may also contain label-related information which refers to the discrimination of a specific class. Thus, only learning the domain-invariant information may limit the class discriminability of models. To address this issue, a low-frequency prototypical contrastive learning is proposed to learn the class-discriminative representation from low-frequency components since it is more domain-specific across different domains. Finally, the class-discriminative representation and high-frequency prototypical representation are fused to simultaneously improve the generalization ability and class discriminability of the model. Extensive experiments demonstrate that the proposed approach outperforms current methods on single- and multi-source domain generalization benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105393"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient and robust multi-camera 3D object detection in bird-eye-view","authors":"Yuanlong Wang, Hengtao Jiang, Guanying Chen, Tong Zhang, Jiaqing Zhou, Zezheng Qing, Chunyan Wang, Wanzhong Zhao","doi":"10.1016/j.imavis.2025.105428","DOIUrl":"10.1016/j.imavis.2025.105428","url":null,"abstract":"<div><div>Bird's-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel's detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 × 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105428"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abbas Rehman , Gu Naijie , Asma Aldrees , Muhammad Umer , Abeer Hakeem , Shtwai Alsubai , Lucia Cascone
{"title":"Advancing brain tumor segmentation and grading through integration of FusionNet and IBCO-based ALCResNet","authors":"Abbas Rehman , Gu Naijie , Asma Aldrees , Muhammad Umer , Abeer Hakeem , Shtwai Alsubai , Lucia Cascone","doi":"10.1016/j.imavis.2025.105432","DOIUrl":"10.1016/j.imavis.2025.105432","url":null,"abstract":"<div><div>Brain tumors represent a significant global health challenge, characterized by uncontrolled cerebral cell growth. The variability in size, shape, and anatomical positioning complicates computational classification, which is crucial for effective treatment planning. Accurate detection is essential, as even small diagnostic inaccuracies can significantly increase the mortality risk. Tumor grade stratification is also critical for automated diagnosis; however, current deep learning models often fall short in achieving the desired effectiveness. In this study, we propose an advanced approach that leverages cutting-edge deep learning techniques to improve early detection and tumor severity grading, facilitating automated diagnosis. Clinical bioinformatics datasets are used to source representative brain tumor images, which undergo pre-processing and data augmentation via a Generative Adversarial Network (GAN). The images are then classified using the Adaptive Layer Cascaded ResNet (ALCResNet) model, optimized with the Improved Border Collie Optimization (IBCO) algorithm for enhanced diagnostic accuracy. The integration of FusionNet for precise segmentation and the IBCO-enhanced ALCResNet for optimized feature extraction and classification forms a novel framework. This unique combination ensures not only accurate segmentation but also enhanced precision in grading tumor severity, addressing key limitations of existing methodologies. For segmentation, the FusionNet deep learning model is employed to identify abnormal regions, which are subsequently classified as Meningioma, Glioma, or Pituitary tumors using ALCResNet. Experimental results demonstrate significant improvements in tumor identification and severity grading, with the proposed method achieving superior precision (99.79%) and accuracy (99.33%) compared to existing classifiers and heuristic approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105432"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ding Yuan , Sizhe Zhang , Hong Zhang , Yangyan Deng , Yifan Yang
{"title":"EMA-GS: Improving sparse point cloud rendering with EMA gradient and anchor upsampling","authors":"Ding Yuan , Sizhe Zhang , Hong Zhang , Yangyan Deng , Yifan Yang","doi":"10.1016/j.imavis.2025.105433","DOIUrl":"10.1016/j.imavis.2025.105433","url":null,"abstract":"<div><div>The 3D Gaussian Splatting (3D-GS) technique combines 3D Gaussian primitives with differentiable rasterization for real-time high-quality novel view synthesis. However, in sparse regions of the initial point cloud, this often results in blurring and needle-like artifacts owing to the inadequacies of the existing densification criterion. To address this, an innovative approach that utilizes the Exponential Moving Average (EMA) of homodirectional positional gradients as the densification criterion is introduced. Additionally, in the early stages of training, anchors are upsampled near representative locations to infill details into the sparse initial point clouds. Testing on challenging datasets such as Mip-NeRF 360, Tanks and Temples, and DeepBlending, the results demonstrate that the proposed method achieves fine detail recovery without redundant Gaussians, exhibiting superior handling of complex scenes with high-quality reconstruction and without requiring excessive storage. The code will be available upon the acceptance of the article.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105433"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}