Chengcheng Song, Hui Li, Tianyang Xu, Xiao-Jun Wu, Josef Kittler
{"title":"RefineFuse:用于多模态图像的多尺度精细融合的端到端网络。","authors":"Chengcheng Song, Hui Li, Tianyang Xu, Xiao-Jun Wu, Josef Kittler","doi":"10.1007/s44267-025-00087-w","DOIUrl":null,"url":null,"abstract":"<p><p>The goal of multi-modality image fusion is to integrate complementary information from different modal images to create high-quality, informative fused images. In recent years, significant advances have been made in deep learning for image fusion tasks. Nevertheless, current fusion techniques are still unable to capture more intricate details from the source images. For instance, many existing methods used for tasks such as infrared and visible image fusion are susceptible to adverse lighting conditions. To enhance the ability of fusion networks to preserve detailed information in complex scenes, we propose RefineFuse, a multi-scale interaction network for multi-modal image fusion tasks. To balance and exploit local detailed features and global semantic information during the fusion process, we utilize specific modules to model cross-modal feature coupling in both the pixel and semantic domains. Specifically, a dual attention-based feature interaction module is introduced to integrate detailed information from both modalities for extracting shallow features. To obtain deep semantic information, we adopt a global attention mechanism for cross-modal feature interaction. Additionally, to bridge the gap between deep semantic information and shallow detailed information, we gradually incorporate deep semantic information to shallow detailed information via specific feature interaction modules. Extensive comparative and generalization experiments demonstrate that RefineFuse achieves high-quality fusions of infrared, visible, and medical images, while also facilitating advanced visual tasks, such as object detection.</p>","PeriodicalId":520376,"journal":{"name":"Visual intelligence","volume":"3 1","pages":"16"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12460437/pdf/","citationCount":"0","resultStr":"{\"title\":\"RefineFuse: an end-to-end network for multi-scale refinement fusion of multi-modality images.\",\"authors\":\"Chengcheng Song, Hui Li, Tianyang Xu, Xiao-Jun Wu, Josef Kittler\",\"doi\":\"10.1007/s44267-025-00087-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The goal of multi-modality image fusion is to integrate complementary information from different modal images to create high-quality, informative fused images. In recent years, significant advances have been made in deep learning for image fusion tasks. Nevertheless, current fusion techniques are still unable to capture more intricate details from the source images. For instance, many existing methods used for tasks such as infrared and visible image fusion are susceptible to adverse lighting conditions. To enhance the ability of fusion networks to preserve detailed information in complex scenes, we propose RefineFuse, a multi-scale interaction network for multi-modal image fusion tasks. To balance and exploit local detailed features and global semantic information during the fusion process, we utilize specific modules to model cross-modal feature coupling in both the pixel and semantic domains. Specifically, a dual attention-based feature interaction module is introduced to integrate detailed information from both modalities for extracting shallow features. To obtain deep semantic information, we adopt a global attention mechanism for cross-modal feature interaction. Additionally, to bridge the gap between deep semantic information and shallow detailed information, we gradually incorporate deep semantic information to shallow detailed information via specific feature interaction modules. Extensive comparative and generalization experiments demonstrate that RefineFuse achieves high-quality fusions of infrared, visible, and medical images, while also facilitating advanced visual tasks, such as object detection.</p>\",\"PeriodicalId\":520376,\"journal\":{\"name\":\"Visual intelligence\",\"volume\":\"3 1\",\"pages\":\"16\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12460437/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Visual intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s44267-025-00087-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/24 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Visual intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s44267-025-00087-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/24 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
RefineFuse: an end-to-end network for multi-scale refinement fusion of multi-modality images.
The goal of multi-modality image fusion is to integrate complementary information from different modal images to create high-quality, informative fused images. In recent years, significant advances have been made in deep learning for image fusion tasks. Nevertheless, current fusion techniques are still unable to capture more intricate details from the source images. For instance, many existing methods used for tasks such as infrared and visible image fusion are susceptible to adverse lighting conditions. To enhance the ability of fusion networks to preserve detailed information in complex scenes, we propose RefineFuse, a multi-scale interaction network for multi-modal image fusion tasks. To balance and exploit local detailed features and global semantic information during the fusion process, we utilize specific modules to model cross-modal feature coupling in both the pixel and semantic domains. Specifically, a dual attention-based feature interaction module is introduced to integrate detailed information from both modalities for extracting shallow features. To obtain deep semantic information, we adopt a global attention mechanism for cross-modal feature interaction. Additionally, to bridge the gap between deep semantic information and shallow detailed information, we gradually incorporate deep semantic information to shallow detailed information via specific feature interaction modules. Extensive comparative and generalization experiments demonstrate that RefineFuse achieves high-quality fusions of infrared, visible, and medical images, while also facilitating advanced visual tasks, such as object detection.