Computer Vision and Image Understanding最新文献

筛选
英文 中文
Joint image-instance spatial–temporal attention for few-shot action recognition 联合图像实例时空注意的少镜头动作识别
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104322
Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying
{"title":"Joint image-instance spatial–temporal attention for few-shot action recognition","authors":"Zefeng Qian ,&nbsp;Chongyang Zhang ,&nbsp;Yifei Huang ,&nbsp;Gang Wang ,&nbsp;Jiangyong Ying","doi":"10.1016/j.cviu.2025.104322","DOIUrl":"10.1016/j.cviu.2025.104322","url":null,"abstract":"<div><div>Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST) for Few-shot Action Recognition. The core concept of I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104322"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Establishing a unified evaluation framework for human motion generation: A comparative analysis of metrics 建立人体运动生成的统一评价框架:指标的比较分析
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104337
Ali Ismail-Fawaz , Maxime Devanne , Stefano Berretti , Jonathan Weber , Germain Forestier
{"title":"Establishing a unified evaluation framework for human motion generation: A comparative analysis of metrics","authors":"Ali Ismail-Fawaz ,&nbsp;Maxime Devanne ,&nbsp;Stefano Berretti ,&nbsp;Jonathan Weber ,&nbsp;Germain Forestier","doi":"10.1016/j.cviu.2025.104337","DOIUrl":"10.1016/j.cviu.2025.104337","url":null,"abstract":"<div><div>The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using two publicly available datasets, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code: <span><span>https://github.com/MSD-IRIMAS/Evaluating-HMG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104337"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mandala simplification: Sacred symmetry meets minimalism 曼荼罗简化:神圣的对称遇上极简主义
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104319
Tusita Sarkar, Preetam Chayan Chatterjee, Partha Bhowmick
{"title":"Mandala simplification: Sacred symmetry meets minimalism","authors":"Tusita Sarkar,&nbsp;Preetam Chayan Chatterjee,&nbsp;Partha Bhowmick","doi":"10.1016/j.cviu.2025.104319","DOIUrl":"10.1016/j.cviu.2025.104319","url":null,"abstract":"<div><div>Mandalas, intricate artistic designs with radial symmetry, are imbued with a timeless allure that transcends cultural boundaries. Found in various cultures and spiritual traditions worldwide, mandalas hold profound significance as symbols of unity, wholeness, and spiritual transformation. At the heart of mandalas lies the concept of sacred symmetry, a timeless principle that resonates with the deepest realms of human consciousness. However, in handcrafted mandalas, symmetry often falls short of perfection, necessitating refinement to evoke harmony and balance. With this in mind, we introduce a computational approach aimed at capturing the all-round symmetry of mandalas through minimalist principles. By leveraging innovative geometric and graph-theoretic tools and an interactive twin atlas, this approach streamlines parameter domains to achieve the revered state of sacred symmetry, epitomizing harmonious balance. This is especially beneficial when dealing with handcrafted mandalas of subpar quality, necessitating concise representations for tasks like mandala editing, recreation, atlas building, and referencing. Experimental findings and related results demonstrate the effectiveness of the proposed methodology.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104319"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Navigating social contexts: A transformer approach to relationship recognition 导航社会环境:关系识别的转换方法
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104327
Lorenzo Berlincioni, Luca Cultrera, Marco Bertini, Alberto Del Bimbo
{"title":"Navigating social contexts: A transformer approach to relationship recognition","authors":"Lorenzo Berlincioni,&nbsp;Luca Cultrera,&nbsp;Marco Bertini,&nbsp;Alberto Del Bimbo","doi":"10.1016/j.cviu.2025.104327","DOIUrl":"10.1016/j.cviu.2025.104327","url":null,"abstract":"<div><div>Recognizing interpersonal relationships is essential for enabling human–computer systems to understand and engage effectively with social contexts. Compared to other computer vision tasks, Interpersonal relation recognition requires an higher semantic understanding of the scene, ranging from large background context to finer clues. We propose a transformer based model that attends to each person pair relation in an image reaching state of the art performances on a classical benchmark dataset People in Social Context (PISC). Our solution differs from others as it makes no use of a separate GNN but relies instead on transformers alone. Additionally, we explore the impact of incorporating additional supervision from occupation labels on relationship recognition performance and we extensively ablate different architectural parameters and loss choices. Furthermore, we compare our model with a recent Large Multimodal Model (LMM) to precisely assess the zero-shot capabilities of such general models over highly specific tasks. Our study contributes to advancing the state of the art in social relationship recognition and highlights the potential of transformer-based models in capturing complex social dynamics from visual data.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104327"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143519148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Brain tumor image segmentation based on shuffle transformer-dynamic convolution and inception dilated convolution 基于洗牌变换-动态卷积和初始扩张卷积的脑肿瘤图像分割
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104324
Lifang Zhou , Ya Wang
{"title":"Brain tumor image segmentation based on shuffle transformer-dynamic convolution and inception dilated convolution","authors":"Lifang Zhou ,&nbsp;Ya Wang","doi":"10.1016/j.cviu.2025.104324","DOIUrl":"10.1016/j.cviu.2025.104324","url":null,"abstract":"<div><div>Accurate segmentation of brain tumors is essential for accurate clinical diagnosis and effective treatment. Convolutional neural networks (CNNs) have improved brain tumor segmentation with their excellent performance in local feature modeling. However, they still face the challenge of unpredictable changes in tumor size and location, because it cannot be effectively matched by CNN-based methods with local and regular receptive fields. To overcome these obstacles, we propose brain tumor image segmentation based on shuffle transformer-dynamic convolution and inception dilated convolution that captures and adapts different features of tumors through multi-scale feature extraction. Our model combines Shuffle Transformer-Dynamic Convolution (STDC) to capture both fine-grained and contextual image details so that it helps improve localization accuracy. In addition, the Inception Dilated Convolution(IDConv) module solves the problem of significant changes in the size of brain tumors, and then captures the information of different size of object. The multi-scale feature aggregation(MSFA) module integrates features from different encoder levels, which contributes to enriching the scale diversity of input patches and enhancing the robustness of segmentation. The experimental results conducted on the BraTS 2019, BraTS 2020, BraTS 2021, and MSD BTS datasets indicate that our model outperforms other state-of-the-art methods in terms of accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104324"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143534227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient feature selection for pre-trained vision transformers 预训练视觉变压器的高效特征选择
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104326
Lan Huang , Jia Zeng , Mengqiang Yu , Weiping Ding , Xingyu Bai , Kangping Wang
{"title":"Efficient feature selection for pre-trained vision transformers","authors":"Lan Huang ,&nbsp;Jia Zeng ,&nbsp;Mengqiang Yu ,&nbsp;Weiping Ding ,&nbsp;Xingyu Bai ,&nbsp;Kangping Wang","doi":"10.1016/j.cviu.2025.104326","DOIUrl":"10.1016/j.cviu.2025.104326","url":null,"abstract":"<div><div>Handcrafted layer-wise vision transformers have demonstrated remarkable performance in image classification. However, their high computational cost limits their practical applications. In this paper, we first identify and highlight the data-independent feature redundancy in pre-trained Vision Transformer (ViT) models. Based on this observation, we explore the feasibility of searching for the best substructure within the original pre-trained model. To this end, we propose EffiSelecViT, a novel pruning method aimed at reducing the computational cost of ViTs while preserving their accuracy. EffiSelecViT introduces importance scores for both self-attention heads and Multi-Layer Perceptron (MLP) neurons in pre-trained ViT models. L1 regularization is applied to constrain and learn these scores. In this simple way, components that are crucial for model performance are assigned higher scores, while those with lower scores are identified as less important and subsequently pruned. Experimental results demonstrate that EffiSelecViT can prune DeiT-B to retain only 64% of FLOPs while maintaining accuracy. This efficiency-accuracy trade-off is consistent across various ViT architectures. Furthermore, qualitative analysis reveals enhanced information expression in the pruned models, affirming the effectiveness and practicality of EffiSelecViT. The code is available at <span><span>https://github.com/ZJ6789/EffiSelecViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104326"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143549732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lifelong visible–infrared person re-identification via replay samples domain-modality-mix reconstruction and cross-domain cognitive network 基于重放样本域-模态-混合重构和跨域认知网络的终身可见-红外人再识别
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104328
Xianyu Zhu , Guoqiang Xiao , Michael S. Lew , Song Wu
{"title":"Lifelong visible–infrared person re-identification via replay samples domain-modality-mix reconstruction and cross-domain cognitive network","authors":"Xianyu Zhu ,&nbsp;Guoqiang Xiao ,&nbsp;Michael S. Lew ,&nbsp;Song Wu","doi":"10.1016/j.cviu.2025.104328","DOIUrl":"10.1016/j.cviu.2025.104328","url":null,"abstract":"<div><div>Adapting statically-trained models to the incessant influx of data streams poses a pivotal research challenge. Concurrently, visible and infrared person re-identification (VI-ReID) offers an all-day surveillance mode to advance intelligent surveillance and elevate public safety precautions. Hence, we are pioneering a more fine-grained exploration of the lifelong VI-ReID task at the camera level, aiming to imbue the learned models with the capabilities of lifelong learning and memory within the continuous data streams. This task confronts dual challenges of cross-modality and cross-domain variations. Thus, in this paper, we proposed a Domain-Modality-Mix (DMM) based replay samples reconstruction strategy and Cross-domain Cognitive Network (CDCN) to address those challenges. Firstly, we establish an effective and expandable baseline model based on residual neural networks. Secondly, capitalizing on the unexploited potential knowledge of a memory bank that archives diverse replay samples, we enhance the anti-forgetting ability of our model by the Domain-Modality-Mix strategy, which devising a cross-domain, cross-modal image-level replay sample reconstruction, effectively alleviating catastrophic forgetting induced by modality and domain variations. Finally, guided by the Chunking Theory in cognitive psychology, we designed a Cross-domain Cognitive Network, which incorporates a camera-aware, expandable graph convolutional cognitive network to facilitate adaptive learning of intra-modal consistencies and cross-modal similarities within continuous cross-domain data streams. Extensive experiments demonstrate that our proposed method has remarkable adaptability and robust resistance to forgetting and outperforms multiple state-of-the-art methods in comparative assessments of the performance of LVI-ReID. The source code of our designed method is at <span><span>https://github.com/SWU-CS-MediaLab/DMM-CDCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104328"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial and temporal beliefs for mistake detection in assembly tasks 装配任务中错误检测的时空信念
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104338
Guodong Ding , Fadime Sener , Shugao Ma , Angela Yao
{"title":"Spatial and temporal beliefs for mistake detection in assembly tasks","authors":"Guodong Ding ,&nbsp;Fadime Sener ,&nbsp;Shugao Ma ,&nbsp;Angela Yao","doi":"10.1016/j.cviu.2025.104338","DOIUrl":"10.1016/j.cviu.2025.104338","url":null,"abstract":"<div><div>Assembly tasks, as an integral part of daily routines and activities, involve a series of sequential steps that are prone to error. This paper proposes a novel method for identifying ordering mistakes in assembly tasks based on knowledge-grounded beliefs. The beliefs comprise spatial and temporal aspects, each serving a unique role. Spatial beliefs capture the structural relationships among assembly components and indicate their topological feasibility. Temporal beliefs model the action preconditions and enforce sequencing constraints. Furthermore, we introduce a learning algorithm that dynamically updates and augments the belief sets online. To evaluate, we first test our approach in deducing predefined rules on synthetic data based on industry assembly. We also verify our approach on the real-world Assembly101 dataset, enhanced with annotations of component information. Our framework achieves superior performance in detecting ordering mistakes under both synthetic and real-world settings, highlighting the effectiveness of our approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104338"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143579726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
View-to-label: Multi-view consistency for self-supervised monocular 3D object detection 视图到标签:多视图一致性的自监督单目3D物体检测
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104320
Issa Mouawad , Nikolas Brasch , Fabian Manhardt , Federico Tombari , Francesca Odone
{"title":"View-to-label: Multi-view consistency for self-supervised monocular 3D object detection","authors":"Issa Mouawad ,&nbsp;Nikolas Brasch ,&nbsp;Fabian Manhardt ,&nbsp;Federico Tombari ,&nbsp;Francesca Odone","doi":"10.1016/j.cviu.2025.104320","DOIUrl":"10.1016/j.cviu.2025.104320","url":null,"abstract":"<div><div>For autonomous vehicles, driving safely is highly dependent on the capability to correctly perceive the environment in the 3D space, hence the task of 3D object detection represents a fundamental aspect of perception. While 3D sensors deliver accurate metric perception, monocular approaches enjoy cost and availability advantages that are valuable in a wide range of applications. Unfortunately, training monocular methods requires a vast amount of annotated data. To compensate for this need, we propose a novel approach to self-supervise 3D object detection purely from RGB video sequences, leveraging geometric constraints and weak labels. Unlike other approaches that exploit additional sensors during training, <em>our method relies on the temporal continuity of video sequences.</em> A supervised pre-training on synthetic data produces initial plausible 3D boxes, then our geometric and photometrically grounded losses provide a strong self-supervision signal that allows the model to be fine-tuned on real data without labels.</div><div>Our experiments on Autonomous Driving benchmark datasets showcase the effectiveness and generality of our approach and the competitive performance compared to other self-supervised approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104320"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143519149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental few-shot instance segmentation without fine-tuning on novel classes 无需对新类进行微调的增量少量实例分割
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI: 10.1016/j.cviu.2025.104323
Luofeng Zhang, Libo Weng, Yuanming Zhang, Fei Gao
{"title":"Incremental few-shot instance segmentation without fine-tuning on novel classes","authors":"Luofeng Zhang,&nbsp;Libo Weng,&nbsp;Yuanming Zhang,&nbsp;Fei Gao","doi":"10.1016/j.cviu.2025.104323","DOIUrl":"10.1016/j.cviu.2025.104323","url":null,"abstract":"<div><div>Many current incremental few-shot object detection and instance segmentation methods necessitate fine-tuning on novel classes, which presents difficulties when training newly emerged classes on devices with limited computational power. In this paper, a finetune-free incremental few-shot instance segmentation method is proposed. Firstly, a novel weight generator (NWG) is proposed to map the embeddings of novel classes to their respective true centers. Then, the limitations of cosine similarity on novel classes with few samples are analyzed, and a simple yet effective improvement called the piecewise function for similarity calculation (PFSC) is proposed. Lastly, a probability dependency method (PD) is designed to mitigate the impact on the performance of base classes after registering novel classes. The comparative experimental results show that the proposed model outperforms existing finetune-free methods much more on MS COCO and VOC datasets, and registration of novel classes has almost no negative impact on the base classes. Therefore, the model exhibits excellent performance and the proposed finetune-free idea enables it to learn novel classes directly through inference on devices with limited computational power.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104323"},"PeriodicalIF":4.3,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143519147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信