Computer Vision and Image Understanding最新文献

筛选
英文 中文
Collaborative Neural Painting
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104298
Nicola Dall’Asen , Willi Menapace , Elia Peruzzo , Enver Sangineto , Yiming Wang , Elisa Ricci
{"title":"Collaborative Neural Painting","authors":"Nicola Dall’Asen ,&nbsp;Willi Menapace ,&nbsp;Elia Peruzzo ,&nbsp;Enver Sangineto ,&nbsp;Yiming Wang ,&nbsp;Elisa Ricci","doi":"10.1016/j.cviu.2025.104298","DOIUrl":"10.1016/j.cviu.2025.104298","url":null,"abstract":"<div><div>The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, <em>Collaborative Neural Painting (CNP)</em>, to facilitate collaborative art painting generation between users and agents. Given any number of user-input <em>brushstrokes</em> as the context or just the desired <em>object class</em>, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users’ modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research. Project page and code: <span><span>this https URL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104298"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing Human Pose Estimation through deep learning approaches: An overview
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104297
Gaetano Dibenedetto , Stefanos Sotiropoulos , Marco Polignano , Giuseppe Cavallo , Pasquale Lops
{"title":"Comparing Human Pose Estimation through deep learning approaches: An overview","authors":"Gaetano Dibenedetto ,&nbsp;Stefanos Sotiropoulos ,&nbsp;Marco Polignano ,&nbsp;Giuseppe Cavallo ,&nbsp;Pasquale Lops","doi":"10.1016/j.cviu.2025.104297","DOIUrl":"10.1016/j.cviu.2025.104297","url":null,"abstract":"<div><div>In the everyday IoT ecosystem, many devices and systems are interconnected in an intelligent living environment to create a comfortable and efficient living space. In this scenario, approaches based on automatic recognition of actions and events can support fully autonomous digital assistants and personalized services. A pivotal component in this domain is “Human Pose Estimation”, which plays a critical role in action recognition for a wide range of applications, including home automation, healthcare, safety, and security. These systems are designed to detect human actions and deliver customized real-time responses and support. Selecting an appropriate technique for Human Pose Estimation is crucial to enhancing these systems for various applications. This choice hinges on the specific environment and can be categorized on the basis of whether the technique is designed for images or videos, single-person or multi-person scenarios, and monocular or multiview inputs. A comprehensive overview of recent research outcomes is essential to showcase the evolution of the research area, along with its underlying principles and varied application domains. Key benchmarks across these techniques are suitable and provide valuable insights into their performance. Hence, the paper summarizes these benchmarks, offering a comparative analysis of the techniques. As research in this field continues to evolve, it is critical for researchers to stay up to date with the latest developments and methodologies to promote further innovations in the field of pose estimation research. Therefore, this comprehensive overview presents a thorough examination of the subject matter, encompassing all pertinent details. Its objective is to equip researchers with the knowledge and resources necessary to investigate the topic and effectively retrieve all relevant information necessary for their investigations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104297"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cleanness-navigated-contamination network: A unified framework for recovering regional degradation
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104274
Qianhao Yu, Naishan Zheng, Jie Huang, Feng Zhao
{"title":"Cleanness-navigated-contamination network: A unified framework for recovering regional degradation","authors":"Qianhao Yu,&nbsp;Naishan Zheng,&nbsp;Jie Huang,&nbsp;Feng Zhao","doi":"10.1016/j.cviu.2024.104274","DOIUrl":"10.1016/j.cviu.2024.104274","url":null,"abstract":"<div><div>Image restoration from regional degradation has long been an important and challenging task. The key to contamination removal is recovering the contents of the corrupted regions with the guidance of the non-corrupted regions. Due to the inadequate long-range modeling, the CNN-based approaches cannot thoroughly investigate the information from non-corrupted regions, resulting in distorted visuals with artificial traces between different regions. To address this issue, we propose a novel Cleanness-Navigated-Contamination Network (CNCNet), which is a unified framework for recovering regional image contamination, such as shadow, flare, and other regional degradation. Our method mainly consists of two components: a contamination-oriented adaptive normalization (COAN) module and a contamination-aware aggregation with transformer (CAAT) module based on the contamination region mask. Under the guidance of the contamination mask, the COAN module formulates the statistics from the non-corrupted region and adaptively applies them to the corrupted region for region-wise restoration. The CAAT module utilizes the region mask to precisely guide the restoration of each contaminated pixel by considering the highly relevant pixels from the contamination-free regions for global pixel-wise restoration. Extensive experiments in both shadow removal tasks and flare removal tasks show that our network framework achieves superior restoration performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104274"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Full-body virtual try-on using top and bottom garments with wearing style control
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104259
Soonchan Park , Jinah Park
{"title":"Full-body virtual try-on using top and bottom garments with wearing style control","authors":"Soonchan Park ,&nbsp;Jinah Park","doi":"10.1016/j.cviu.2024.104259","DOIUrl":"10.1016/j.cviu.2024.104259","url":null,"abstract":"<div><div>Various studies have been proposed to synthesize realistic images for image-based virtual try-on, but most of them are limited to replacing a single item on a given model, without considering wearing styles. In this paper, we address the novel problem of <em>full-body</em> virtual try-on with <em>multiple</em> garments by introducing a new benchmark dataset and an image synthesis method. Our Fashion-TB dataset provides comprehensive clothing information by mapping fashion models to their corresponding top and bottom garments, along with semantic region annotations to represent the structure of the garments. WGF-VITON, the single-stage network we have developed, generates full-body try-on images using top and bottom garments simultaneously. Instead of relying on preceding networks to estimate intermediate knowledge, modules for garment transformation and image synthesis are integrated and trained through end-to-end learning. Furthermore, our method proposes Wearing-guide scheme to control the wearing styles in the synthesized try-on images. Through various experiments, for the full-body virtual try-on task, WGF-VITON outperforms state-of-the-art networks in both quantitative and qualitative evaluations with an optimized number of parameters while allowing users to control the wearing styles of the output images. The code and data are available at <span><span>https://github.com/soonchanpark/WGF-VITON</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104259"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSL-Rehab: Assessment of physical rehabilitation exercises through self-supervised learning of 3D skeleton representations
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104275
Ikram Kourbane, Panagiotis Papadakis, Mihai Andries
{"title":"SSL-Rehab: Assessment of physical rehabilitation exercises through self-supervised learning of 3D skeleton representations","authors":"Ikram Kourbane,&nbsp;Panagiotis Papadakis,&nbsp;Mihai Andries","doi":"10.1016/j.cviu.2024.104275","DOIUrl":"10.1016/j.cviu.2024.104275","url":null,"abstract":"<div><div>Rehabilitation aims to assist individuals in recovering or enhancing functions that have been lost or impaired due to injury, illness, or disease. The automatic assessment of physical rehabilitation exercises offers a valuable method for patient supervision, complementing or potentially substituting traditional clinical evaluations. However, acquiring large-scale annotated datasets presents challenges, prompting the need for self-supervised learning and transfer learning in the rehabilitation domain. Our proposed approach integrates these two strategies through Low-Rank Adaptation (LoRA) for both pretraining and fine-tuning. Specifically, we train a foundation model to learn robust 3D skeleton features that adapt to varying levels of masked motion complexity through a three-stage process. In the first stage, we apply a high masking ratio to a subset of joints, using a transformer-based architecture with a graph embedding layer to capture fundamental motion features. In the second stage, we reduce the masking ratio and expand the model’s capacity to learn more intricate motion patterns and interactions between joints. Finally, in the third stage, we further lower the masking ratio to enable the model to refine its understanding of detailed motion dynamics, optimizing its overall performance. During the second and third stages, LoRA layers are incorporated to extract unique features tailored to each masking level, ensuring efficient adaptation without significantly increasing the model size. Fine-tuning for downstream tasks shows that the model performs better when different masked motion levels are utilized. Through extensive experiments conducted on the publicly available KIMORE and UI-PRMD datasets, we demonstrate the effectiveness of our approach in accurately evaluating the execution quality of rehabilitation exercises, surpassing state-of-the-art performance across all metrics. <span><span>Our project page is available online</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104275"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonlocal Gaussian scale mixture modeling for hyperspectral image denoising
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104270
Ling Ding, Qiong Wang, Yin Poo, Xinggan Zhang
{"title":"Nonlocal Gaussian scale mixture modeling for hyperspectral image denoising","authors":"Ling Ding,&nbsp;Qiong Wang,&nbsp;Yin Poo,&nbsp;Xinggan Zhang","doi":"10.1016/j.cviu.2024.104270","DOIUrl":"10.1016/j.cviu.2024.104270","url":null,"abstract":"<div><div>Recent nonlocal sparsity methods have gained significant attention in hyperspectral image (HSI) denoising. These methods leverage the nonlocal self-similarity (NSS) prior to group similar full-band patches into nonlocal full-band groups, followed by enforcing a sparsity constraint, usually through soft-thresholding or hard-thresholding operators, on each nonlocal full-band group. However, in these methods, given that real HSI data are non-stationary and affected by noise, the variances of the sparse coefficients are unknown and challenging to accurately estimate from the degraded HSI, leading to suboptimal denoising performance. In this paper, we propose a novel nonlocal Gaussian scale mixture (NGSM) approach for HSI denoising, which significantly enhances the estimation accuracy of both the variances of the sparse coefficients and the unknown sparse coefficients. To reduce spectral redundancy, a global spectral low-rank (LR) prior is integrated with the NGSM model and consolidated into a variational framework for optimization. Extensive experimental results demonstrate that the proposed NGSM algorithm achieves convincing improvements over many state-of-the-art HSI denoising methods, both in quantitative and visual evaluations, while offering exceptional computational efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104270"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104269
Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic
{"title":"ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition","authors":"Aydin Saribudak ,&nbsp;Sifan Yuan ,&nbsp;Chenyang Gao ,&nbsp;Waverly V. Gestrich-Thompson ,&nbsp;Zachary P. Milestone ,&nbsp;Randall S. Burd ,&nbsp;Ivan Marsic","doi":"10.1016/j.cviu.2024.104269","DOIUrl":"10.1016/j.cviu.2024.104269","url":null,"abstract":"<div><div>Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the <span>aselmar</span> framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. <span>aselmar</span> (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied <span>aselmar</span> to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean <span>ap</span> score metric. The label verification required by <span>aselmar</span> was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean <span>ap</span> score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, <span>aselmar</span> achieved a decrease in <span>ap</span> score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104269"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RelFormer: Advancing contextual relations for transformer-based dense captioning
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104300
Weiqi Jin , Mengxue Qu , Caijuan Shi , Yao Zhao , Yunchao Wei
{"title":"RelFormer: Advancing contextual relations for transformer-based dense captioning","authors":"Weiqi Jin ,&nbsp;Mengxue Qu ,&nbsp;Caijuan Shi ,&nbsp;Yao Zhao ,&nbsp;Yunchao Wei","doi":"10.1016/j.cviu.2025.104300","DOIUrl":"10.1016/j.cviu.2025.104300","url":null,"abstract":"<div><div>Dense captioning aims to detect regions in images and generate natural language descriptions for each identified region. For this task, contextual modeling is crucial for generating accurate descriptions since regions in the image could interact with each other. Previous efforts primarily focused on the modeling between categorized object regions, which are extracted by pre-trained object detectors, <em>e.g</em>., Fast R-CNN. However, they overlook the contextual modeling for non-object regions, <em>e.g</em>., sky, rivers, and grass, commonly referred to as “stuff”. In this paper, we propose the RelFormer framework to enhance the contextual relation modeling of Transformer-based dense captioning. Specifically, we design a clip-assisted region feature extraction module to extract rich contextual features of regions, involving stuff regions. We then introduce a straightforward relation encoder based on self-attention to effectively model relationships between regional features. To accurately extract candidate regions in dense images while minimizing redundant proposals, we further introduce the amplified decay non-maximum-suppression, which amplifies the decay degree of the redundant proposals so that they can be removed while reserving the detection of the small regions under a low confidence threshold. The experimental results indicate that by enhancing contextual interactions, our model exhibits a good understanding of regions and attains state-of-the-art performance on dense captioning tasks. Our method achieves 17.52% mAP on VG V1.0, 16.59% on VG V1.2, and 15.49% on VG-COCO. Code is available at <span><span>https://github.com/Wykay/Relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104300"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DM-Align: Leveraging the power of natural language instructions to make changes to images
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104292
Maria-Mihaela Trusca , Tinne Tuytelaars , Marie-Francine Moens
{"title":"DM-Align: Leveraging the power of natural language instructions to make changes to images","authors":"Maria-Mihaela Trusca ,&nbsp;Tinne Tuytelaars ,&nbsp;Marie-Francine Moens","doi":"10.1016/j.cviu.2025.104292","DOIUrl":"10.1016/j.cviu.2025.104292","url":null,"abstract":"<div><div>Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104292"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rebalanced supervised contrastive learning with prototypes for long-tailed visual recognition
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104291
Xuhui Chang, Junhai Zhai, Shaoxin Qiu, Zhengrong Sun
{"title":"Rebalanced supervised contrastive learning with prototypes for long-tailed visual recognition","authors":"Xuhui Chang,&nbsp;Junhai Zhai,&nbsp;Shaoxin Qiu,&nbsp;Zhengrong Sun","doi":"10.1016/j.cviu.2025.104291","DOIUrl":"10.1016/j.cviu.2025.104291","url":null,"abstract":"<div><div>In the real world, data often follows a long-tailed distribution, resulting in head classes receiving more attention while tail classes are frequently overlooked. Although supervised contrastive learning (SCL) performs well on balanced datasets, it struggles to distinguish features between tail classes in the latent space when dealing with long-tailed data. To address this issue, we propose Rebalanced Supervised Contrastive Learning (ReCL), which can effectively enhance the separability of tail classes features. Compared with two state-of-the-art methods, Contrastive Learning based hybrid networks (Hybrid-SC) and Targeted Supervised Contrastive Learning (TSC), ReCL has two distinctive characteristics: (1) ReCL enhances the clarity of classification boundaries between tail classes by encouraging samples to align more closely with their corresponding prototypes. (2) ReCL does not require targets generation, thereby conserving computational resources. Our method significantly improves the recognition of tail classes, demonstrating competitive accuracy across multiple long-tailed datasets. Our code has been uploaded to <span><span>https://github.com/cxh981110/ReCL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104291"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信