International Journal of Computer Vision最新文献

筛选
英文 中文
Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data 基于多模态数据的视红外人脸识别融合
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-18 DOI: 10.1007/s11263-025-02396-5
Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger
{"title":"Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data","authors":"Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger","doi":"10.1007/s11263-025-02396-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02396-5","url":null,"abstract":"<p>Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images face corruptions such as blur, noise, and weather. Despite their practical relevance, deep learning models for multimodal V-I ReID remain far less investigated than for single and cross-modal V to I settings. Moreover, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID – named Multimodal Middle Stream Fusion (MMSF) – that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing for dynamic balancing of the importance of each modality. The literature typically reports ReID performance using clean datasets, but more recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios, using data with realistic corruptions. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, the proposed ML-MDA is shown as essential for a V-I person ReID system to sustain high accuracy and robustness in face of corrupted multimodal images. Our multimodal ReID models attains the best accuracy and complexity trade-off under both CL and NCL settings and compared to state-of-art unimodal ReID systems, except for the ThermalWORLD dataset due to its low-quality I. Our MMSF model outperforms every method under CL and NCL camera scenarios. GitHub code: https://github.com/art2611/MREiD-UCD-CCD.git.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements 行人属性识别中共现偏差的解决方法:理论、算法与改进
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-18 DOI: 10.1007/s11263-025-02405-7
Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi Wang
{"title":"A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements","authors":"Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi Wang","doi":"10.1007/s11263-025-02405-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02405-7","url":null,"abstract":"<p>For the pedestrian attributes recognition, we demonstrate that deep models can memorize the pattern of attributes co-occurrences inherent to dataset, whether through explicit or implicit means. However, since the attributes interdependency is highly variable and unpredictable across different scenarios, the modeled attributes co-occurrences de facto serve as a data selection bias that hardly generalizes onto out-of-distribution samples. To address this thorny issue, we formulate a novel concept of attributes-disentangled feature learning, by which the mutual information among features of different attributes is minimized, ensuring the recognition of an attribute independent to the presence of others. Stemming from it, practical approaches are developed to effectively decouple attributes by suppressing the shared feature factors among attributes-specific features. As compelling merits, our method is exercised with minimal test-time computation, and is also highly extendable. With slight modifications on it, further improvements regarding better exploration of the feature space, softening the issue of imbalanced attributes distribution in dataset and flexibility in term of preserving certain causal attributes interdependencies can be achieved. Comprehensive experiments on various realistic datasets, such as PA100k, PETAzs and RAPzs, validate the efficacy and a spectrum of superiorities of our method.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"70 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model 多文本引导很重要:基于大型生成视觉语言模型的多模态图像融合
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-17 DOI: 10.1007/s11263-025-02409-3
Zeyu Wang, Libo Zhao, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang
{"title":"Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model","authors":"Zeyu Wang, Libo Zhao, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang","doi":"10.1007/s11263-025-02409-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02409-3","url":null,"abstract":"<p>Multi-modality image fusion aims to extract complementary features from multiple source images of different modalities, generating a fused image that inherits their advantages. To address challenges in cross-modality shared feature (CMSF) extraction, single-modality specific feature (SMSF) fusion, and the absence of ground truth (GT) images, we propose MTG-Fusion, a multi-text guided model. We leverage the capabilities of large vision-language models to generate text descriptions tailored to the input images, providing novel insights for these challenges. Our model introduces a text-guided CMSF extractor (TGCE) and a text-guided SMSF fusion module (TGSF). TGCE transforms visual features into the text domain using manifold-isometric domain transform techniques and provides effective visual-text interaction based on text-vision and text-text distances. TGSF fuses each dimension of visual features with corresponding text features, creating a weight matrix utilized for SMSF fusion. We also incorporate the constructed textual GT into the loss function for collaborative training. Extensive experiments demonstrate that MTG-Fusion achieves state-of-the-art performance on infrared and visible image fusion and medical image fusion tasks. The code is available at: https://github.com/zhaolb4080/MTG-Fusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Not All Pixels are Equal: Learning Pixel Hardness for Semantic Segmentation 并非所有像素都是相等的:学习语义分割的像素硬度
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-17 DOI: 10.1007/s11263-025-02416-4
Xin Xiao, Daiguo Zhou, Jiagao Hu, Yi Hu, Yongchao Xu
{"title":"Not All Pixels are Equal: Learning Pixel Hardness for Semantic Segmentation","authors":"Xin Xiao, Daiguo Zhou, Jiagao Hu, Yi Hu, Yongchao Xu","doi":"10.1007/s11263-025-02416-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02416-4","url":null,"abstract":"<p>Semantic segmentation has witnessed great progress. Despite the impressive overall results, the segmentation performance in some hard areas (<i>e.g.</i>, small objects or thin parts) is still not promising. A straightforward solution is hard sample mining. Yet, most existing hard pixel mining strategies for semantic segmentation often rely on pixel’s loss value, which tends to decrease during training. Intuitively, the pixel hardness for segmentation mainly depends on image structure and is expected to be stable. In this paper, we propose to learn pixel hardness for semantic segmentation by leveraging hardness information contained in global and historical loss values. More precisely, we add a gradient-independent branch for learning a hardness level (HL) map by maximizing hardness-weighted segmentation loss, which is minimized for the segmentation head. This encourages large hardness values in difficult areas, leading to appropriate and stable HL map. Despite its simplicity, the proposed method can be applied to most segmentation methods with no and marginal extra cost during inference and training, respectively. Without bells and whistles, the proposed method achieves consistent improvement (1.37% mIoU on average) over most popular semantic segmentation methods on the Cityscapes dataset, and demonstrates good generalization ability across domains. The source codes are available at this link.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"69 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Source Domain Adaptation by Causal-Guided Adaptive Multimodal Diffusion Networks 通过因果引导的自适应多模态扩散网络实现多源领域自适应
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-15 DOI: 10.1007/s11263-025-02401-x
Ziyun Cai, Yawen Huang, Tengfei Zhang, Yefeng Zheng, Dong Yue
{"title":"Multi-Source Domain Adaptation by Causal-Guided Adaptive Multimodal Diffusion Networks","authors":"Ziyun Cai, Yawen Huang, Tengfei Zhang, Yefeng Zheng, Dong Yue","doi":"10.1007/s11263-025-02401-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02401-x","url":null,"abstract":"<p>Multi-source domain adaptation (MSDA) strives to adapt the models trained on multimodal labelled source domains to an unlabelled target domain. Recent GANs based MSDA methods implicitly characterize the image distribution, which may result in limited sample fidelity, causing misalignment of pixel-level information among sources and the target. Furthermore, when samples from different sources interfere during the learning process, significant misalignment across different source domains may arise. In this paper, we propose a novel MSDA framework, called Causal-guided Adaptive Multimodal Diffusion Networks (C-AMDN), to tackle these challenges. C-AMDN incorporates a diffusive adversarial generation model for high-fidelity, efficient adaptation among source and target domains, along with deep causal inference re-weighting mechanism for the decision-making process that the conditional distributions of outcomes remain consistent across different domains, even as the input distributions change. In addition, we propose an efficient way to further adapt the input image to another domain: we preserve important semantic information by a density constraint regularization in the generation model. Experimental results demonstrate that C-AMDN significantly outperforms existing methods across several real-world domain adaptation benchmarks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"89 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143627758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expressive Image Generation and Editing with Rich Text 富有表现力的图像生成和编辑与富文本
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-14 DOI: 10.1007/s11263-025-02361-2
Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang
{"title":"Expressive Image Generation and Editing with Rich Text","authors":"Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang","doi":"10.1007/s11263-025-02361-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02361-2","url":null,"abstract":"<p>Plain text has become a prevalent interface for text-based image synthesis and editing. Its limited customization options, however, hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. Furthermore, describing a reference concept or texture in plain text is non-trivial. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, texture fill, footnote, and embedded image. We extract each word’s attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis with reference concepts or texture. We achieve these capabilities through a region-based diffusion process. We first obtain each word’s mask that characterizes the region guided by the word. For each region, we enforce its text attributes by creating customized prompts, applying guidance within the region, and maintaining its fidelity against plain-text generations or input images through region-based injections. We present various examples of image generation and editing from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"60 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143618571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation 基于Möbius-Inspired变换的多模态生成视觉模型参数高效微调
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-13 DOI: 10.1007/s11263-025-02398-3
Haoran Duan, Shuai Shao, Bing Zhai, Tejal Shah, Jungong Han, Rajiv Ranjan
{"title":"Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation","authors":"Haoran Duan, Shuai Shao, Bing Zhai, Tejal Shah, Jungong Han, Rajiv Ranjan","doi":"10.1007/s11263-025-02398-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02398-3","url":null,"abstract":"<p>The rapid development of multimodal generative vision models has drawn scientific curiosity. Notable advancements, such as OpenAI’s ChatGPT and Stable Diffusion, demonstrate the potential of combining multimodal data for generative content. Nonetheless, customising these models to specific domains or tasks is challenging due to computational costs and data requirements. Conventional fine-tuning methods take redundant processing resources, motivating the development of parameter-efficient fine-tuning technologies such as adapter module, low-rank factorization and orthogonal fine-tuning. These solutions selectively change a subset of model parameters, reducing learning needs while maintaining high-quality results. Orthogonal fine-tuning, regarded as a reliable technique, preserves semantic linkages in weight space but has limitations in its expressive powers. To better overcome these constraints, we provide a simple but innovative and effective transformation method inspired by Möbius geometry, which replaces conventional orthogonal transformations in parameter-efficient fine-tuning. This strategy improved fine-tuning’s adaptability and expressiveness, allowing it to capture more data patterns. Our strategy, which is supported by theoretical understanding and empirical validation, outperforms existing approaches, demonstrating competitive improvements in generation quality for key generative tasks.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"16 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143618570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exemplar-Free Continual Learning of Vision Transformers via Gated Class-Attention and Cascaded Feature Drift Compensation 基于门控类注意和级联特征漂移补偿的视觉变压器无样例持续学习
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-13 DOI: 10.1007/s11263-025-02374-x
Marco Cotogni, Fei Yang, Claudio Cusano, Andrew D. Bagdanov, Joost van de Weijer
{"title":"Exemplar-Free Continual Learning of Vision Transformers via Gated Class-Attention and Cascaded Feature Drift Compensation","authors":"Marco Cotogni, Fei Yang, Claudio Cusano, Andrew D. Bagdanov, Joost van de Weijer","doi":"10.1007/s11263-025-02374-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02374-x","url":null,"abstract":"<p>Vision transformers (ViTs) have achieved remarkable successes across a broad range of computer vision applications. As a consequence, there has been increasing interest in extending continual learning theory and techniques to ViT architectures. We propose a new method for exemplar-free class incremental training of ViTs. The main challenge of exemplar-free continual learning is maintaining plasticity of the learner without causing catastrophic forgetting of previously learned tasks. This is often achieved via exemplar replay which can help recalibrate previous task classifiers to the feature drift which occurs when learning new tasks. Exemplar replay, however, comes at the cost of retaining samples from previous tasks which for many applications may not be possible. To address the problem of continual ViT training, we first propose <i>gated class-attention</i> to minimize the drift in the final ViT transformer block. This mask-based gating is applied to class-attention mechanism of the last transformer block and strongly regulates the weights crucial for previous tasks. Importantly, gated class-attention does not require the task-ID during inference, which distinguishes it from other parameter isolation methods. Secondly, we propose a new method of <i>feature drift compensation</i> that accommodates feature drift in the backbone when learning new tasks. The combination of gated class-attention and cascaded feature drift compensation allows for plasticity towards new tasks while limiting forgetting of previous ones. Extensive experiments performed on CIFAR-100, Tiny-ImageNet and ImageNet100 demonstrate that our exemplar-free method obtains competitive results when compared to rehearsal based ViT methods.(Code:https://github.com/OcraM17/GCAB-CFDC)</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143608063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attribute-Centric Compositional Text-to-Image Generation 以属性为中心的合成文本到图像生成
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-13 DOI: 10.1007/s11263-025-02371-0
Yuren Cong, Martin Renqiang Min, Li Erran Li, Bodo Rosenhahn, Michael Ying Yang
{"title":"Attribute-Centric Compositional Text-to-Image Generation","authors":"Yuren Cong, Martin Renqiang Min, Li Erran Li, Bodo Rosenhahn, Michael Ying Yang","doi":"10.1007/s11263-025-02371-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02371-0","url":null,"abstract":"<p>Despite the recent impressive breakthroughs in text-to-image generation, generative models have difficulty in capturing the data distribution of underrepresented attribute compositions while over-memorizing overrepresented attribute compositions, which raises public concerns about their robustness and fairness. To tackle this challenge, we propose <b>ACTIG</b>, an attribute-centric compositional text-to-image generation framework. We present an attribute-centric feature augmentation and a novel image-free training scheme, which greatly improves model’s ability to generate images with underrepresented attributes. We further propose an attribute-centric contrastive loss to avoid overfitting to overrepresented attribute compositions. We validate our framework on the CelebA-HQ and CUB datasets. Extensive experiments show that the compositional generalization of ACTIG is outstanding, and our framework outperforms previous works in terms of image quality and text-image consistency. The source code and trained models are publicly available at https://github.com/yrcong/ACTIG.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"23 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143618568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UniFace++: Revisiting a Unified Framework for Face Reenactment and Swapping via 3D Priors uniface++:通过3D先验重新访问人脸再现和交换的统一框架
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-11 DOI: 10.1007/s11263-025-02395-6
Chao Xu, Yijie Qian, Shaoting Zhu, Baigui Sun, Jian Zhao, Yong Liu, Xuelong Li
{"title":"UniFace++: Revisiting a Unified Framework for Face Reenactment and Swapping via 3D Priors","authors":"Chao Xu, Yijie Qian, Shaoting Zhu, Baigui Sun, Jian Zhao, Yong Liu, Xuelong Li","doi":"10.1007/s11263-025-02395-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02395-6","url":null,"abstract":"<p>Face reenactment and swapping share a similar pattern of identity and attribute manipulation. Our previous work UniFace has preliminarily explored establishing a unification between the two at the feature level, but it heavily relies on the accuracy of feature disentanglement, and GANs are also unstable during training. In this work, we delve into the intrinsic connections between the two from a more general training paradigm perspective, introducing a novel diffusion-based unified method UniFace++. Specifically, this work combines the advantages of each, <i>i.e.</i>, stability of reconstruction training from reenactment, simplicity and effectiveness of the target-oriented processing from swapping, and redefining both as target-oriented reconstruction tasks. In this way, face reenactment avoids complex source feature deformation and face swapping mitigates the unstable seesaw-style optimization. The core of our approach is the rendered face obtained from reassembled 3D facial priors serving as the target pivot, which contains precise geometry and coarse identity textures. We further incorporate it with the proposed Texture-Geometry-aware Diffusion Model (TGDM) to perform texture transfer under the reconstruction supervision for high-fidelity face synthesis. Extensive quantitative and qualitative experiments demonstrate the superiority of our method for both tasks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143599231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信