Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan
{"title":"多模态图像生成和融合,通过内容风格的混合解纠缠","authors":"Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan","doi":"10.1016/j.knosys.2025.114597","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114597"},"PeriodicalIF":7.6000,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal image generation and fusion through content-style hybrid disentanglement\",\"authors\":\"Xu Cao , Huanxin Zou , Jun Li , Hao Chen , Xinyi Ying , Shitian He , Yingqian Wang , Liyuan Pan\",\"doi\":\"10.1016/j.knosys.2025.114597\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"330 \",\"pages\":\"Article 114597\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705125016363\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125016363","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multimodal image generation and fusion through content-style hybrid disentanglement
Multimodal image fusion and cross-modal translation are fundamental yet challenging tasks in computer vision, with their performance directly impacting downstream applications. Existing approaches typically treat these tasks independently, developing specialized models that fail to exploit the intrinsic relationships between different modalities. This limitation not only restricts model generalizability but also hinders further performance improvements. In this paper, we propose a joint optimization framework for image generation and fusion. Specifically, we generalize multimodal image tasks as the fusion and transformation of cross-modal features, and design a hybrid task training strategy. At the data level, we introduce a self-supervised and mutual-supervised hybrid mechanism for content-style feature decoupling, which achieves superior feature separation through stepwise training on intra-modal and cross-modal data. At the model level, we construct a triple-branch decoupling head along with fusion and transformation modules to ensure synchronous and efficient execution of dual tasks. Our method not only breaks through the single task limitation of the model, but also innovatively introduces mixed supervision into multimodal processing. We conduct comprehensive experiments covering four modalities fusion tasks on seven popular datasets. Extensive experimental results demonstrate that our method achieves superior performance on two tasks as compared of the respective state-of-the-art methods, and show impressive cross-task generalization capability.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.