Dongdong Sun , Chuanyun Wang , Tian Wang , Qian Gao , Qiong Liu , Linlin Wang
{"title":"CLIPFusion:基于图像-文本大模型和自适应学习的红外和可见光图像融合网络","authors":"Dongdong Sun , Chuanyun Wang , Tian Wang , Qian Gao , Qiong Liu , Linlin Wang","doi":"10.1016/j.displa.2025.103042","DOIUrl":null,"url":null,"abstract":"<div><div>The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"89 ","pages":"Article 103042"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIPFusion: Infrared and visible image fusion network based on image–text large model and adaptive learning\",\"authors\":\"Dongdong Sun , Chuanyun Wang , Tian Wang , Qian Gao , Qiong Liu , Linlin Wang\",\"doi\":\"10.1016/j.displa.2025.103042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.</div></div>\",\"PeriodicalId\":50570,\"journal\":{\"name\":\"Displays\",\"volume\":\"89 \",\"pages\":\"Article 103042\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Displays\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0141938225000794\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225000794","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
CLIPFusion: Infrared and visible image fusion network based on image–text large model and adaptive learning
The goal of infrared and visible image fusion is to integrate complementary multimodal images to produce highly informative and visually effective fused images, which have a wide range of applications in automated driving, fault diagnosis and night vision. Since the infrared and visible image fusion task usually does not have real labels as a reference, the design of the loss function is highly influenced by human subjectivity, which limits the performance of the model. To address the issue of insufficient real labels, this paper designs a prompt generation network based on the image–text large model, which learns text prompts for different types of images by restricting the distances between unimodal image prompts and fused image prompts to the corresponding images in the potential space of the image–text large model. The learned prompt texts are then used as labels for fused image generation by constraining the distance between the fused image and the different prompt texts in the latent space of the large image–text model. To further improve the quality of the fused images, this paper uses the fused images generated with different iterations to adaptively fine-tune the prompt generation network to continuously improve the quality of the generated prompt text labels and indirectly improve the visual effect of the fused images. In addition, to minimise the influence of subjective information in the fused image generation process, a 3D convolution-based fused image generation network is proposed to achieve the integration of infrared and visible feature through adaptive learning in additional dimensions. Extensive experiments show that the proposed model exhibits good visual effects and quantitative metrics in infrared–visible image fusion tasks in military scenarios, autopilot scenarios and dark-light scenarios, as well as good generalisation ability in multi-focus image fusion and medical image fusion tasks.
期刊介绍:
Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface.
Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.