Computer Vision and Image Understanding最新文献

筛选
英文 中文
Semantic scene understanding through advanced object context analysis in image 通过图像中高级对象上下文分析实现语义场景理解
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2025.104299
Luis Hernando Ríos González , Sebastián López Flórez , Alfonso González-Briones , Fernando de la Prieta
{"title":"Semantic scene understanding through advanced object context analysis in image","authors":"Luis Hernando Ríos González ,&nbsp;Sebastián López Flórez ,&nbsp;Alfonso González-Briones ,&nbsp;Fernando de la Prieta","doi":"10.1016/j.cviu.2025.104299","DOIUrl":"10.1016/j.cviu.2025.104299","url":null,"abstract":"<div><div>Advancements in computer vision have primarily concentrated on interpreting visual data, often overlooking the significance of contextual differences across various regions within images. In contrast, our research introduces a model for indoor scene recognition that pivots towards the ‘attention’ paradigm. This model views attention as a response to the stimulus image properties, suggesting that focus is ‘pulled’ towards the most visually salient zones within an image, as represented in a saliency map. Attention is directed towards these zones based on uninterpreted semantic features of the image, such as luminance contrast, color, shape, and edge orientation. This neurobiologically plausible and computationally tractable approach offers a more nuanced understanding of scenes by prioritizing zones solely based on their image properties. The proposed model enhances scene understanding through an in-depth analysis of the object context in images. Scene recognition is achieved by extracting features from selected regions of interest within individual image frames using patch-based object detection techniques, thus generating distinctive feature descriptors for the identified objects of interest. The resulting feature descriptors are then subjected to semantic embedding, which uses distributed representations to transform the sparse feature vectors into dense semantic vectors within a learned latent space. This enables subsequent classification tasks by machine learning models trained on embedded semantic representations. This model was evaluated on three image datasets: UIUC Sports-8, PASCAL VOC - Visual Object Classes, and a proprietary image set created by the authors. Compared to state-of-the-art methods, this paper presents a more robust approach to the abstraction and generalization of interior scenes. This approach has demonstrated superior accuracy with our novel model over existing models. Consequently, this has led to an improvement in the classification of scenes in the selected indoor environments. Our code is published here: <span><span>https://github.com/sebastianlop8/Semantic-Scene-Object-Context-Analysis.git</span><svg><path></path></svg></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104299"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive semantic guidance network for video captioning 视频字幕的自适应语义引导网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104255
Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi
{"title":"Adaptive semantic guidance network for video captioning","authors":"Yuanyuan Liu ,&nbsp;Hong Zhu ,&nbsp;Zhong Wu ,&nbsp;Sen Du ,&nbsp;Shuning Wu ,&nbsp;Jing Shi","doi":"10.1016/j.cviu.2024.104255","DOIUrl":"10.1016/j.cviu.2024.104255","url":null,"abstract":"<div><div>Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104255"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-supervised vision transformers for semantic segmentation 语义分割的自监督视觉变形
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104272
Xianfan Gu , Yingdong Hu , Chuan Wen , Yang Gao
{"title":"Self-supervised vision transformers for semantic segmentation","authors":"Xianfan Gu ,&nbsp;Yingdong Hu ,&nbsp;Chuan Wen ,&nbsp;Yang Gao","doi":"10.1016/j.cviu.2024.104272","DOIUrl":"10.1016/j.cviu.2024.104272","url":null,"abstract":"<div><div>Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104272"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP RS3Lip:基于自监督学习和CLIP的部分嵌入遥感图像分类一致性
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104254
Ankit Jha , Mainak Singha , Avigyan Bhattacharya , Biplab Banerjee
{"title":"RS3Lip: Consistency for remote sensing image classification on part embeddings using self-supervised learning and CLIP","authors":"Ankit Jha ,&nbsp;Mainak Singha ,&nbsp;Avigyan Bhattacharya ,&nbsp;Biplab Banerjee","doi":"10.1016/j.cviu.2024.104254","DOIUrl":"10.1016/j.cviu.2024.104254","url":null,"abstract":"<div><div>Tackling domain and class generalization challenges remains a significant hurdle in the realm of remote sensing (RS). Recently, large-scale pre-trained vision-language models (VLMs), exemplified by CLIP, have showcased impressive zero-shot and few-shot generalization capabilities through extensive contrastive training. Existing literature emphasizes prompt learning as a means of enriching prompts with both domain and content information, particularly through smaller learnable projectors, thereby addressing multi-domain data challenges perceptibly. Along with this, it is observed that CLIP’s vision encoder fails to generalize well on the puzzled or corrupted RS images. In response, we propose a novel solution utilizing self-supervised learning (SSL) to ensure consistency for puzzled RS images in domain generalization (DG). This approach strengthens visual features, facilitating the generation of domain-invariant prompts. Our proposed RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip, trained with small projectors featuring few layers, complements the pre-trained CLIP. It incorporates SSL and inpainting losses for visual features, along with a consistency loss between the features of SSL tasks and textual features. Empirical findings demonstrate that RS<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>Lip consistently outperforms state-of-the-art prompt learning methods across five benchmark optical remote sensing datasets, achieving improvements of at least by 1.3% in domain and class generalization tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104254"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial Semi-supervised domain adaptation for semantic segmentation: A new role for labeled target samples 对抗性半监督域自适应语义分割:标记目标样本的新作用
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-30 DOI: 10.1016/j.cviu.2025.104305
Marwa Kechaou , Mokhtar Z. Alaya , Romain Hérault , Gilles Gasso
{"title":"Adversarial Semi-supervised domain adaptation for semantic segmentation: A new role for labeled target samples","authors":"Marwa Kechaou ,&nbsp;Mokhtar Z. Alaya ,&nbsp;Romain Hérault ,&nbsp;Gilles Gasso","doi":"10.1016/j.cviu.2025.104305","DOIUrl":"10.1016/j.cviu.2025.104305","url":null,"abstract":"<div><div>Adversarial learning baselines for domain adaptation (DA) approaches in the context of semantic segmentation are under explored in semi-supervised framework. These baselines involve solely the available labeled target samples in the supervision loss. In this work, we propose to enhance their usefulness on both semantic segmentation and the single domain classifier neural networks. We design new training objective losses for cases when labeled target data behave as source samples or as real target samples. The underlying rationale is that considering the set of labeled target samples as part of source domain helps reducing the domain discrepancy and, hence, improves the contribution of the adversarial loss. To support our approach, we consider a complementary method that mixes source and labeled target data, then applies the same adaptation process. We further propose an unsupervised selection procedure using entropy to optimize the choice of labeled target samples for adaptation. We illustrate our findings through extensive experiments on the benchmarks GTA5, SYNTHIA, and Cityscapes. The empirical evaluation highlights competitive performance of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104305"},"PeriodicalIF":4.3,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143349645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mask prior generation with language queries guided networks for referring image segmentation 用语言查询引导网络进行参考图像分割的掩码先验生成
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-29 DOI: 10.1016/j.cviu.2025.104296
Jinhao Zhou , Guoqiang Xiao , Michael S. Lew , Song Wu
{"title":"Mask prior generation with language queries guided networks for referring image segmentation","authors":"Jinhao Zhou ,&nbsp;Guoqiang Xiao ,&nbsp;Michael S. Lew ,&nbsp;Song Wu","doi":"10.1016/j.cviu.2025.104296","DOIUrl":"10.1016/j.cviu.2025.104296","url":null,"abstract":"<div><div>The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPG-LQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at <span><span>https://github.com/SWU-CS-MediaLab/MPG-LQGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104296"},"PeriodicalIF":4.3,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143136341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Luminance prior guided Low-Light 4C catenary image enhancement 亮度先验引导低光4C链线图像增强
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-27 DOI: 10.1016/j.cviu.2025.104287
Zhenhua Xue , Jun Luo , Zhenlin Wei
{"title":"Luminance prior guided Low-Light 4C catenary image enhancement","authors":"Zhenhua Xue ,&nbsp;Jun Luo ,&nbsp;Zhenlin Wei","doi":"10.1016/j.cviu.2025.104287","DOIUrl":"10.1016/j.cviu.2025.104287","url":null,"abstract":"<div><div>In scenarios characterized by inadequate fill lighting, catenary images captured by railway power supply 4C monitoring equipment often exhibit a phenomenon of low light, which can pose significant challenges for accurately detecting anomalies in the equipment. This, in turn, has ramifications for the smooth operation, timely maintenance, and overall safety assurance of railway systems. Recognizing this critical issue, our study introduces an innovative dual-branch priori-guided enhancement method specifically tailored for low-light catenary images obtained through powered 4C monitoring equipment. Within the multi-scale branch of our method, we leverage the powerful capabilities of convolutional neural networks (CNNs) along with the self-attention mechanism to effectively extract both local and global features from the images. This dual focus allows our model to capture intricate details and broader contextual information, enhancing its ability to understand and enhance the images. Concurrently, the pixel-wise branch of our method is designed to estimate enhancement parameters at the pixel level, enabling an adaptive and iterative enhancement process. This fine-grained approach ensures that each pixel in the image is optimized based on its unique characteristics and context, leading to more nuanced and accurate enhancements. To further inform and constrain our enhancement process, we conduct a statistical analysis of the average light intensity of images under both normal and low-light conditions. By examining the differences and correlations between image brightness under these varying light conditions, we derive statistical priors that are integrated into our method. These priors serve as valuable guidance for our model, helping it to make more informed decisions during the enhancement process. Moreover, to mitigate the challenges associated with obtaining labeled data, we adopt an unsupervised model training strategy. This approach allows our method to learn and improve without the need for extensive and costly labeling efforts, making it more practical and scalable for real-world applications. Experimental results demonstrate the superiority of our proposed method when compared to state-of-the-art approaches for low-light catenary image enhancement. Our method improves the visual quality of the images, ultimately contributing to the safety and efficiency of railway operations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104287"},"PeriodicalIF":4.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143268673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale Riemannian meta-optimization via subspace adaptation 基于子空间自适应的大规模黎曼元优化
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-27 DOI: 10.1016/j.cviu.2025.104306
Peilin Yu , Yuwei Wu , Zhi Gao , Xiaomeng Fan , Yunde Jia
{"title":"Large-scale Riemannian meta-optimization via subspace adaptation","authors":"Peilin Yu ,&nbsp;Yuwei Wu ,&nbsp;Zhi Gao ,&nbsp;Xiaomeng Fan ,&nbsp;Yunde Jia","doi":"10.1016/j.cviu.2025.104306","DOIUrl":"10.1016/j.cviu.2025.104306","url":null,"abstract":"<div><div>Riemannian meta-optimization provides a promising approach to solving non-linear constrained optimization problems, which trains neural networks as optimizers to perform optimization on Riemannian manifolds. However, existing Riemannian meta-optimization methods take up huge memory footprints in large-scale optimization settings, as the learned optimizer can only adapt gradients of a fixed size and thus cannot be shared across different Riemannian parameters. In this paper, we propose an efficient Riemannian meta-optimization method that significantly reduces the memory burden for large-scale optimization via a subspace adaptation scheme. Our method trains neural networks to individually adapt the row and column subspaces of Riemannian gradients, instead of directly adapting the full gradient matrices in existing Riemannian meta-optimization methods. In this case, our learned optimizer can be shared across Riemannian parameters with different sizes. Our method reduces the model memory consumption by six orders of magnitude when optimizing an orthogonal mainstream deep neural network (<em>e.g.</em> ResNet50). Experiments on multiple Riemannian tasks show that our method can not only reduce the memory consumption but also improve the performance of Riemannian meta-optimization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104306"},"PeriodicalIF":4.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143268687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond shadows and light: Odyssey of face recognition for social good 超越阴影与光明:人脸识别的社会公益奥德赛
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-24 DOI: 10.1016/j.cviu.2025.104293
Chiranjeev Chiranjeev, Muskan Dosi, Shivang Agarwal, Jyoti Chaudhary , Pranav Pant, Mayank Vatsa, Richa Singh
{"title":"Beyond shadows and light: Odyssey of face recognition for social good","authors":"Chiranjeev Chiranjeev,&nbsp;Muskan Dosi,&nbsp;Shivang Agarwal,&nbsp;Jyoti Chaudhary ,&nbsp;Pranav Pant,&nbsp;Mayank Vatsa,&nbsp;Richa Singh","doi":"10.1016/j.cviu.2025.104293","DOIUrl":"10.1016/j.cviu.2025.104293","url":null,"abstract":"<div><div>Face recognition technology, though undeniably transformative in its technical evolution, remains conspicuously underleveraged in humanitarian endeavors. This survey highlights its latent utility in addressing critical societal exigencies, ranging from the expeditious identification of disaster-afflicted individuals to locating missing children. We investigate technical complexities arising from facial feature degradation, aging, occlusions, and low-resolution images. These issues are frequently encountered in real-world scenarios. We provide a comprehensive review of state-of-the-art models and relevant datasets, including a meta-analysis of existing and curated collections such as the newly introduced Web and Generated Injured Faces (WGIF) dataset. Our evaluation encompasses the performance of current face recognition algorithms in real-world scenarios, exemplified by a case study on the Balasore train accident in India. By examining factors such as the impact of aging on facial features and the limitations of traditional models in handling low-quality or occluded images, we showcase the complexities inherent in applying face recognition for societal good. We discuss future research directions, emphasizing the need for interdisciplinary collaborations and innovative methodologies to enhance the adaptability and robustness of face recognition systems in humanitarian contexts. Through detailed case studies, we provide insights into the effectiveness of current methods and identify key areas for improvement. Our goal is to encourage the development of specialized face recognition models for social welfare applications, contributing to timely and accurate identification in critical situations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104293"},"PeriodicalIF":4.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143419469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation UATST:通过跨空间调制实现不配对的任意文本引导样式传输
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104246
Haibo Chen , Lei Zhao
{"title":"UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation","authors":"Haibo Chen ,&nbsp;Lei Zhao","doi":"10.1016/j.cviu.2024.104246","DOIUrl":"10.1016/j.cviu.2024.104246","url":null,"abstract":"<div><div>Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel <strong>U</strong>npaired <strong>A</strong>rbitrary <strong>T</strong>ext-guided <strong>S</strong>tyle <strong>T</strong>ransfer (<strong>UATST</strong>) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104246"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信