多模态图像生成和融合，通过内容风格的混合解纠缠

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-10-11 DOI:10.1016/j.knosys.2025.114597

Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan

{"title":"多模态图像生成和融合，通过内容风格的混合解纠缠","authors":"Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan","doi":"10.1016/j.knosys.2025.114597","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114597"},"PeriodicalIF":7.6000,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal image generation and fusion through content-style hybrid disentanglement\",\"authors\":\"Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan\",\"doi\":\"10.1016/j.knosys.2025.114597\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"330 \",\"pages\":\"Article 114597\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125016363\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125016363","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态图像融合和跨模态翻译是计算机视觉领域的基础任务，其性能直接影响到下游应用。现有的方法通常独立处理这些任务，开发专门的模型，无法利用不同模式之间的内在关系。这种限制不仅限制了模型的可泛化性，而且阻碍了进一步的性能改进。本文提出了一种用于图像生成和融合的联合优化框架。具体而言，我们将多模态图像任务概括为跨模态特征的融合与转换，并设计了混合任务训练策略。在数据层面，我们引入了一种自监督和互监督的内容式特征解耦混合机制，通过对内模态和跨模态数据的逐步训练，实现了较好的特征分离。在模型级别，我们构建了一个三分支解耦头以及融合和转换模块，以确保同步和有效地执行双重任务。该方法不仅突破了模型单一任务的限制，而且创新性地将混合监督引入到多模态处理中。我们在七个流行的数据集上进行了涵盖四种模式融合任务的综合实验。大量的实验结果表明，我们的方法在两个任务上取得了优于各自最先进方法的性能，并显示出令人印象深刻的跨任务泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal image generation and fusion through content-style hybrid disentanglement

Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.