CLIPFusion:基于图像-文本大模型和自适应学习的红外和可见光图像融合网络

IF 3.7 2区 工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Dongdong Sun , Chuanyun Wang , Tian Wang , Qian Gao , Qiong Liu , Linlin Wang
{"title":"CLIPFusion:基于图像-文本大模型和自适应学习的红外和可见光图像融合网络","authors":"Dongdong Sun ,&nbsp;Chuanyun Wang ,&nbsp;Tian Wang ,&nbsp;Qian Gao ,&nbsp;Qiong Liu ,&nbsp;Linlin Wang","doi":"10.1016/j.displa.2025.103042","DOIUrl":null,"url":null,"abstract":"<div><div>The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"89 ","pages":"Article 103042"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIPFusion: Infrared and visible image fusion network based on image–text large model and adaptive learning\",\"authors\":\"Dongdong Sun ,&nbsp;Chuanyun Wang ,&nbsp;Tian Wang ,&nbsp;Qian Gao ,&nbsp;Qiong Liu ,&nbsp;Linlin Wang\",\"doi\":\"10.1016/j.displa.2025.103042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.</div></div>\",\"PeriodicalId\":50570,\"journal\":{\"name\":\"Displays\",\"volume\":\"89 \",\"pages\":\"Article 103042\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Displays\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0141938225000794\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225000794","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

红外和可见光图像融合的目标是将互补的多模态图像融合在一起,产生信息量高、视觉效果好的融合图像,在自动驾驶、故障诊断和夜视等领域有着广泛的应用。由于红外和可见光图像融合任务通常没有真实的标签作为参考,因此损失函数的设计受人的主观性影响很大,从而限制了模型的性能。为了解决真实标签不足的问题,本文设计了一个基于图像-文本大模型的提示生成网络,通过将单峰图像提示与融合图像提示之间的距离限制在图像-文本大模型的潜在空间中对应的图像上,学习不同类型图像的文本提示。然后,通过在大图像-文本模型的潜在空间中约束融合图像与不同提示文本之间的距离,将学习到的提示文本用作融合图像生成的标签。为了进一步提高融合图像的质量,本文利用不同迭代生成的融合图像自适应微调提示生成网络,不断提高生成的提示文本标签的质量,间接提高融合图像的视觉效果。此外,为了最大限度地减少主观信息对融合图像生成过程的影响,提出了一种基于三维卷积的融合图像生成网络,通过额外维度的自适应学习实现红外和可见光特征的融合。大量实验表明,该模型在军事场景、自动驾驶场景和暗光场景的红外-可见光图像融合任务中具有良好的视觉效果和定量指标,在多焦点图像融合和医学图像融合任务中具有良好的泛化能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CLIPFusion: Infrared and visible image fusion network based on image–text large model and adaptive learning
The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Displays
Displays 工程技术-工程:电子与电气
CiteScore
4.60
自引率
25.60%
发文量
138
审稿时长
92 days
期刊介绍: Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信