Computer Vision and Image Understanding最新文献

筛选
英文 中文
Cascading attention enhancement network for RGB-D indoor scene segmentation RGB-D室内场景分割的级联注意力增强网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-06-06 DOI: 10.1016/j.cviu.2025.104411
Xu Tang , Songyang Cen , Zhanhao Deng , Zejun Zhang , Yan Meng , Jianxiao Xie , Changbing Tang , Weichuan Zhang , Guanghui Zhao
{"title":"Cascading attention enhancement network for RGB-D indoor scene segmentation","authors":"Xu Tang ,&nbsp;Songyang Cen ,&nbsp;Zhanhao Deng ,&nbsp;Zejun Zhang ,&nbsp;Yan Meng ,&nbsp;Jianxiao Xie ,&nbsp;Changbing Tang ,&nbsp;Weichuan Zhang ,&nbsp;Guanghui Zhao","doi":"10.1016/j.cviu.2025.104411","DOIUrl":"10.1016/j.cviu.2025.104411","url":null,"abstract":"<div><div>Convolutional neural network based Red, Green, Blue, and Depth (RGB-D) image semantic segmentation for indoor scenes has attracted increasing attention, because of its great potentiality of extracting semantic information from RGB-D images. However, the challenge it brings lies in how to effectively fuse features from RGB and depth images within the neural network architecture. The technical approach of feature aggregation has evolved from the early integration of RGB color images and depth images to the current cross-attention fusion, which enables the features of different RGB channels to be fully integrated with ones of the depth image. However, noises and useless feature for segmentation are inevitably propagated between feature layers during the period of feature aggregation, thereby affecting the accuracy of segmentation results. In this paper, for indoor scenes, a cascading attention enhancement network (CAENet) is proposed with the aim of progressively refining the semantic features of RGB and depth images layer by layer, consisting of four modules: a channel enhancement module (CEM), an adaptive aggregation of spatial attention (AASA), an adaptive aggregation of channel attention (AACA), and a triple-path fusion module (TFM). In encoding stage, CEM complements RGB features with depth features at the end of each layer, in order to effectively revise RGB features for the next layer. At the end of encoding stage, AASA module combines low-level and high-level RGB semantic features by their spatial attention, and AACA module fuses low-level and high-level depth semantic features by their channel attention. The combined RGB and depth semantic features are fused into one and fed into the decoding stage, which consists of triple-path fusion modules (TFMs) combining low-level RGB and depth semantic features and decoded high-level semantic features. The TFM outputs multi-scale feature maps that encapsulate both rich semantic information and fine-grained details, thereby augmenting the model’s capacity for accurate per-pixel semantic label prediction. The proposed CAENet achieves mIoU of 52.0% on NYUDv2 and 48.3% on SUNRGB-D datasets, outperforming recent RGB-D segmentation methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104411"},"PeriodicalIF":4.3,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144270155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery HCTD:用于无人机航拍图像中精确目标检测的CNN-transformer混合
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-06-05 DOI: 10.1016/j.cviu.2025.104409
Hongcheng Xue , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li
{"title":"HCTD: A CNN-transformer hybrid for precise object detection in UAV aerial imagery","authors":"Hongcheng Xue ,&nbsp;Zhan Tang ,&nbsp;Yuantian Xia ,&nbsp;Longhe Wang ,&nbsp;Lin Li","doi":"10.1016/j.cviu.2025.104409","DOIUrl":"10.1016/j.cviu.2025.104409","url":null,"abstract":"<div><div>Object detection in UAV imagery poses substantial challenges due to severe object scale variation, dense distributions of small objects, complex backgrounds, and arbitrary orientations. These factors, compounded by high inter-class similarity and large intra-class variation caused by multi-scale targets, occlusion, and environmental interference, make aerial object detection fundamentally different from conventional scenes. Existing methods often struggle to capture global semantic information effectively and tend to overlook critical issues such as feature loss during downsampling, information redundancy, and inconsistency in cross-level feature interactions. To address these limitations, this paper proposes a hybrid CNN-Transformer-based detector, termed HCTD, specifically designed for UAV image analysis. The proposed framework integrates three novel modules: (1) a Feature Filtering Module (FFM) that enhances discriminative responses and suppresses background noise through dual global pooling (max and average) strategies; (2) a Convolutional Additive Self-attention Feature Interaction (CASFI) module that replaces dot-product attention with a lightweight additive fusion of spatial and channel interactions, enabling efficient global context modeling at reduced computational cost; and (3) a Global Context Flow Feature Pyramid Network (GC2FPN) that facilitates multi-scale semantic propagation and alignment to improve small-object detection robustness. Extensive experiments on the VisDrone2019 dataset demonstrate that HCTD-R18 and HCTD-R50 achieve 38.2%/43.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>50</mn></mrow></msub></math></span>, 23.1%/24.6% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mn>75</mn></mrow></msub></math></span>, and 13.9%/14.7% <span><math><msub><mrow><mi>AP</mi></mrow><mrow><mi>S</mi></mrow></msub></math></span> respectively. Additionally, the TIDE toolkit is employed to analyze the absolute and relative contributions of six error types, providing deeper insight into the effectiveness of each module and offering valuable guidance for future improvements. The code is available at: <span><span>https://github.com/Mundane-X/HCTD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104409"},"PeriodicalIF":4.3,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144263064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatio-temporal graph neural network based child action recognition using data-efficient methods: A systematic analysis 基于时空图神经网络的儿童动作识别数据高效方法系统分析
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-06-03 DOI: 10.1016/j.cviu.2025.104410
Sanka Mohottala , Asiri Gawesha , Dharshana Kasthurirathna , Pradeepa Samarasinghe , Charith Abhayaratne
{"title":"Spatio-temporal graph neural network based child action recognition using data-efficient methods: A systematic analysis","authors":"Sanka Mohottala ,&nbsp;Asiri Gawesha ,&nbsp;Dharshana Kasthurirathna ,&nbsp;Pradeepa Samarasinghe ,&nbsp;Charith Abhayaratne","doi":"10.1016/j.cviu.2025.104410","DOIUrl":"10.1016/j.cviu.2025.104410","url":null,"abstract":"<div><div>This paper presents implementations on child activity recognition (CAR) using spatial–temporal graph neural network (ST-GNN)-based deep learning models with the skeleton modality. Prior implementations in this domain have predominantly utilized CNN, LSTM, and other methods, despite the superior performance potential of graph neural networks. To the best of our knowledge, this study is the first to use an ST-GNN model for child activity recognition employing both in-the-lab, in-the-wild, and in-the-deployment skeleton data. To overcome the challenges posed by small publicly available child action datasets, transfer learning methods such as feature extraction and fine-tuning were applied to enhance model performance.</div><div>As a principal contribution, we developed an ST-GNN-based skeleton modality model that, despite using a relatively small child action dataset, achieved superior performance (94.81%) compared to implementations trained on a significantly larger (x10) adult action dataset (90.6%) for a similar subset of actions. With ST-GCN-based feature extraction and fine-tuning methods, accuracy improved by 10%–40% compared to vanilla implementations, achieving a maximum accuracy of 94.81%. Additionally, implementations with other ST-GNN models demonstrated further accuracy improvements of 15%–45% over the ST-GCN baseline.</div><div>The results on activity datasets empirically demonstrate that class diversity, dataset size, and careful selection of pre-training datasets significantly enhance accuracy. In-the-wild and in-the-deployment implementations confirm the real-world applicability of above approaches, with the ST-GNN model achieving 11 FPS on streaming data. Finally, preliminary evidence on the impact of graph expressivity and graph rewiring on accuracy of small dataset-based models is provided, outlining potential directions for future research. The codes are available at <span><span>https://github.com/sankamohotttala/ST_GNN_HAR_DEML</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104410"},"PeriodicalIF":4.3,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSGN:CLIP-driven semantic guidance network for Clothes-Changing Person Re-Identification 基于clip驱动的换装人再识别语义引导网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-06-02 DOI: 10.1016/j.cviu.2025.104406
Yang Lu , Bin Ge , Chenxing Xia , Junming Guan
{"title":"CSGN:CLIP-driven semantic guidance network for Clothes-Changing Person Re-Identification","authors":"Yang Lu ,&nbsp;Bin Ge ,&nbsp;Chenxing Xia ,&nbsp;Junming Guan","doi":"10.1016/j.cviu.2025.104406","DOIUrl":"10.1016/j.cviu.2025.104406","url":null,"abstract":"<div><div>Clothes-Changing Person Re-identification (CCReID) aims to match identities across images of individuals in different attires. Due to the significant appearance variations caused by clothing changes, distinguishing the same identity becomes challenging, while the differences between distinct individuals are often subtle. To address this, we reduce the impact of clothing information on identity judgment by introducing linguistic modalities. Considering CLIP’s (Contrastive Language-Image Pre-training) ability to align high-level semantic information with visual features, we propose a CLIP-driven Semantic Guidance Network (CSGN), which consists of a Multi-Description Generator (MDG), a Visual Semantic Steering module (VSS), and a Heterogeneous Semantic Fusion loss (HSF). Specifically, to mitigate the color sensitivity of CLIP’s text encoder, we design the MDG to generate pseudo-text in both RGB and grayscale modalities, incorporating a combined loss function for text-image mutuality. This helps reduce the encoder’s bias towards color. Additionally, to improve the CLIP visual encoder’s ability to extract identity-independent features, we construct the VSS, which combines ResNet and ViT feature extractors to enhance visual feature extraction. Finally, recognizing the complementary nature of semantics in heterogeneous descriptions, we use HSF, which constrains visual features by focusing not only on pseudo-text derived from RGB but also on pseudo-text derived from grayscale, thereby mitigating the influence of clothing information. Experimental results show that our method outperforms existing state-of-the-art approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104406"},"PeriodicalIF":4.3,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144253992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continuous conditional video synthesis by neural processes 基于神经过程的连续条件视频合成
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-05-29 DOI: 10.1016/j.cviu.2025.104387
Xi Ye, Guillaume-Alexandre Bilodeau
{"title":"Continuous conditional video synthesis by neural processes","authors":"Xi Ye,&nbsp;Guillaume-Alexandre Bilodeau","doi":"10.1016/j.cviu.2025.104387","DOIUrl":"10.1016/j.cviu.2025.104387","url":null,"abstract":"<div><div>Different conditional video synthesis tasks, such as frame interpolation and future frame prediction, are typically addressed individually by task-specific models, despite their shared underlying characteristics. Additionally, most conditional video synthesis models are limited to discrete frame generation at specific integer time steps. This paper presents a unified model that tackles both challenges simultaneously. We demonstrate that conditional video synthesis can be formulated as a neural process, where input spatio-temporal coordinates are mapped to target pixel values by conditioning on context spatio-temporal coordinates and pixel values. Our approach leverages a Transformer-based non-autoregressive conditional video synthesis model that takes the implicit neural representation of coordinates and context pixel features as input. Our task-specific models outperform previous methods for future frame prediction and frame interpolation across multiple datasets. Importantly, our model enables temporal continuous video synthesis at arbitrary high frame rates, outperforming the previous state-of-the-art. The source code and video demos for our model are available at <span><span>https://npvp.github.io</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104387"},"PeriodicalIF":4.3,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient human–object-interaction (EHOI) detection via interaction label coding and Conditional Decision 基于交互标签编码和条件决策的高效人-物交互检测
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-05-22 DOI: 10.1016/j.cviu.2025.104390
Tsung-Shan Yang , Yun-Cheng Wang , Chengwei Wei , Suya You , C.-C. Jay Kuo
{"title":"Efficient human–object-interaction (EHOI) detection via interaction label coding and Conditional Decision","authors":"Tsung-Shan Yang ,&nbsp;Yun-Cheng Wang ,&nbsp;Chengwei Wei ,&nbsp;Suya You ,&nbsp;C.-C. Jay Kuo","doi":"10.1016/j.cviu.2025.104390","DOIUrl":"10.1016/j.cviu.2025.104390","url":null,"abstract":"<div><div>Human–Object Interaction (HOI) detection is a fundamental task in image understanding. While deep-learning-based HOI methods provide high performance in terms of mean Average Precision (mAP), they are computationally expensive and opaque in training and inference processes. An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency. EHOI is a two-stage method. In the first stage, it leverages a frozen object detector to localize the objects and extract various features as intermediate outputs. In the second stage, the first-stage outputs predict the interaction type using the XGBoost classifier. Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases, which reduces the model size and the complexity of the XGBoost classifier in the second stage. Additionally, we provide a mathematical formulation of the relabeling and decision-making process. Apart from the architecture, we present qualitative results to explain the functionalities of the feedforward modules. Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method. The codes are available: <span><span>https://github.com/keevin60907/EHOI---Efficient-Human-Object-Interaction-Detector</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104390"},"PeriodicalIF":4.3,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144115541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HVQ-VAE: Variational auto-encoder with hyperbolic vector quantization 双曲矢量量化的变分自编码器
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-05-21 DOI: 10.1016/j.cviu.2025.104392
Shangyu Chen , Pengfei Fang , Mehrtash Harandi , Trung Le , Jianfei Cai , Dinh Phung
{"title":"HVQ-VAE: Variational auto-encoder with hyperbolic vector quantization","authors":"Shangyu Chen ,&nbsp;Pengfei Fang ,&nbsp;Mehrtash Harandi ,&nbsp;Trung Le ,&nbsp;Jianfei Cai ,&nbsp;Dinh Phung","doi":"10.1016/j.cviu.2025.104392","DOIUrl":"10.1016/j.cviu.2025.104392","url":null,"abstract":"<div><div>Vector quantized-variational autoencoder (VQ-VAE) and its variants have made significant progress in creating discrete latent space via learning a codebook. Previous works on VQ-VAE have focused on discrete latent spaces in Euclidean or in spherical spaces. This paper studies the geometric prior of hyperbolic spaces as a way to improve the learning capacity of VQ-VAE. That being said, working with the VQ-VAE in the hyperbolic space is not without difficulties, and the benefits of using hyperbolic space as the geometric prior for the latent space have never been studied in VQ-VAE. We bridge this gap by developing the VQ-VAE with hyperbolic vector quantization. To this end, we propose the hyperbolic VQ-VAE (HVQ-VAE), which learns the latent embedding of data and the codebook in the hyperbolic space. Specifically, we endow the discrete latent space in the Poincaré ball, such that the clustering algorithm can be formulated and optimized in the Poincaré ball. Thorough experiments against various baselines are conducted to evaluate the superiority of the proposed HVQ-VAE empirically. We show that HVQ-VAE enjoys better image reconstruction, effective codebook usage, and fast convergence than baselines. We also present evidence that HVQ-VAE outperforms VQ-VAE in low-dimensional latent space.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104392"},"PeriodicalIF":4.3,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144134661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uniss-MDF: A Multidimensional Face dataset for assessing face analysis on the move unis - mdf:用于评估移动中的人脸分析的多维人脸数据集
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-05-19 DOI: 10.1016/j.cviu.2025.104384
Pietro Ruiu , Marinella Iole Cadoni , Andrea Lagorio , Seth Nixon , Filippo Casu , Massimo Farina , Mauro Fadda , Giuseppe A. Trunfio , Massimo Tistarelli , Enrico Grosso
{"title":"Uniss-MDF: A Multidimensional Face dataset for assessing face analysis on the move","authors":"Pietro Ruiu ,&nbsp;Marinella Iole Cadoni ,&nbsp;Andrea Lagorio ,&nbsp;Seth Nixon ,&nbsp;Filippo Casu ,&nbsp;Massimo Farina ,&nbsp;Mauro Fadda ,&nbsp;Giuseppe A. Trunfio ,&nbsp;Massimo Tistarelli ,&nbsp;Enrico Grosso","doi":"10.1016/j.cviu.2025.104384","DOIUrl":"10.1016/j.cviu.2025.104384","url":null,"abstract":"<div><div>Multidimensional 2D–3D face analysis has demonstrated a strong potential for human identification in several application domains. The combined, synergic use of 2D and 3D data from human faces can counteract typical limitations in 2D face recognition, while improving both accuracy and robustness in identification. On the other hand, current mobile devices, often equipped with depth cameras and high performance computing resources, offer a powerful and practical tool to better investigate new models to jointly process real 2D and 3D face data. However, recent concerns related to privacy of individuals and the collection, storage and processing of personally identifiable biometric information have diminished the availability of public face recognition datasets.</div><div>Uniss-MDF (Uniss-MultiDimensional Face) represents the first collection of combined 2D–3D data of human faces captured with a mobile device. Over 76,000 depth images and videos are captured from over 100 subjects, in both controlled and uncontrolled conditions, over two sessions. The features of Uniss-MDF are extensively compared with existing 2D–3D face datasets. The reported statistics underscore the value of the dataset as a versatile resource for researchers in face recognition on the move and for a wide range of applications. Notably, it is the sole 2D–3D facial dataset using data from a mobile device that includes both 2D and 3D synchronized sequences acquired in controlled and uncontrolled conditions. The Uniss-MDF dataset and the proposed experimental protocols with baseline results provide a new platform to compare processing models for novel research avenues in advanced face analysis on the move.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104384"},"PeriodicalIF":4.3,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-scale feature fusion based SAM for high-quality few-shot medical image segmentation 基于多尺度特征融合的SAM高质量少镜头医学图像分割
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-05-16 DOI: 10.1016/j.cviu.2025.104389
Shangwang Liu, Ruonan Xu
{"title":"Multi-scale feature fusion based SAM for high-quality few-shot medical image segmentation","authors":"Shangwang Liu,&nbsp;Ruonan Xu","doi":"10.1016/j.cviu.2025.104389","DOIUrl":"10.1016/j.cviu.2025.104389","url":null,"abstract":"<div><div>Applying the Segmentation Everything Model (SAM) to the field of medical image segmentation is a great challenge due to the significant differences between natural and medical images. Direct fine-tuning of SAM using medical images requires a large amount of exhaustively annotated medical image data. This paper aims to propose a new method, High-quality Few-shot Segmentation Everything Model (HF-SAM), to address these issues and achieve efficient medical image segmentation. We proposed HF-SAM, which requires only a small number of medical images for model training and does not need precise medical cues for fine-tuning SAM. HF-SAM employs Low-rank adaptive (LoRA) technology to fine-tune SAM by leveraging the lack of large local details in the image embedding of SAM’s mask decoder and the complementarity between high-level global and low-level local features. Additionally, we propose an Adaptive Weighted Feature Fusion Module (AWFFM) and a two-step skip-feature fusion decoding process. The AWFFM integrates low-level local information into high-level global features without suppressing global information, while the two-step skip-feature fusion decoding process enhances SAM’s ability to capture fine-grained information and local details. Experimental results show that HF-SAM achieves Dice scores of 79.50% on the Synapse dataset and 88.68% on the ACDC dataset. These results outperform existing traditional methods, semi-supervised methods, and other SAM variants in few-shot medical image segmentation. By combining low-rank adaptive technology and the adaptive weighted feature fusion module, HF-SAM effectively addresses the adaptability issues of SAM in medical image segmentation and demonstrates excellent segmentation performance with few samples. This method provides a new solution for the field of medical image segmentation and holds significant application value. The code of HF-SAM is available at <span><span>https://github.com/1683194873xrn/HF-SAM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104389"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting few-shot point cloud segmentation with intra-class correlation and iterative prototype fusion 基于类内关联和迭代原型融合的小点云分割算法
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-05-16 DOI: 10.1016/j.cviu.2025.104393
Xindan Zhang , Ying Li , Xinnian Zhang
{"title":"Boosting few-shot point cloud segmentation with intra-class correlation and iterative prototype fusion","authors":"Xindan Zhang ,&nbsp;Ying Li ,&nbsp;Xinnian Zhang","doi":"10.1016/j.cviu.2025.104393","DOIUrl":"10.1016/j.cviu.2025.104393","url":null,"abstract":"<div><div>Semantic segmentation of 3D point clouds is often limited by the challenge of obtaining labeled data. Few-shot point cloud segmentation methods, which can learn previously unseen categories, help reduce reliance on labeled datasets. However, existing methods are susceptible to correlation noise and suffer from significant discrepancies between support prototypes and query features. To address these issues, we first introduce an intra-class correlation enhancement module for filtering correlation noise driven by inter-class similarity and intra-class diversity. Second, to better represent the target classes, we propose an iterative prototype fusion module that adapts the query point cloud feature space, mitigating the problem of object variations in the support set and query set. Extensive experiments on S3DIS and ScanNet benchmark datasets demonstrate that our approach achieves competitive performance with state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104393"},"PeriodicalIF":4.3,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144098449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信