Visible and thermal image fusion network with diffusion models for high-level visual tasks

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-01-09 DOI:10.1007/s10489-024-06210-6

Jin Meng, Jiahui Zou, Zhuoheng Xiang, Cui Wang, Shifeng Wang, Yan Li, Jonghyuk Kim

{"title":"Visible and thermal image fusion network with diffusion models for high-level visual tasks","authors":"Jin Meng, Jiahui Zou, Zhuoheng Xiang, Cui Wang, Shifeng Wang, Yan Li, Jonghyuk Kim","doi":"10.1007/s10489-024-06210-6","DOIUrl":null,"url":null,"abstract":"<div><p>Fusion technology enhances the performance of applications such as security, autonomous driving, military surveillance, medical imaging, and environmental monitoring by combining complementary information. The fusion of visible and thermal (RGB-T) images is critical for improving human observation and visual tasks. However, the training of most semantics-driven fusion algorithms combines segmentation and fusion tasks, thereby increasing the computational cost and underutilizing semantic information. Designing a cleaner fusion architecture to mine rich deep semantic features is the key to addressing this issue. A two-stage RGB-T image fusion network with diffusion models is proposed in this paper. In the first stage, the diffusion model is employed to extract multiscale features. This provided rich semantic features and texture edges for the fusion network. In the next stage, semantic feature enhancement module (SFEM) and detail feature enhancement module (DFEM) are proposed to improve the network’s ability to describe small details. An adaptive global-local attention mechanism (AGAM) is used to enhance the weights of key features related to visual tasks. Specifically, we benchmarked the proposed algorithm by creating a new tri-modal sensor driving scene dataset (TSDS), which includes 15234 sets of labeled images (visible, thermal, and polarization degree images). The semantic segmentation model trained on our fusion images achieved 78.41% accuracy, and the object detection model achieved 87.21% MAP. The experimental results indicate that our algorithm outperforms the state-of-the-art image fusion algorithms.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 4","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06210-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Fusion technology enhances the performance of applications such as security, autonomous driving, military surveillance, medical imaging, and environmental monitoring by combining complementary information. The fusion of visible and thermal (RGB-T) images is critical for improving human observation and visual tasks. However, the training of most semantics-driven fusion algorithms combines segmentation and fusion tasks, thereby increasing the computational cost and underutilizing semantic information. Designing a cleaner fusion architecture to mine rich deep semantic features is the key to addressing this issue. A two-stage RGB-T image fusion network with diffusion models is proposed in this paper. In the first stage, the diffusion model is employed to extract multiscale features. This provided rich semantic features and texture edges for the fusion network. In the next stage, semantic feature enhancement module (SFEM) and detail feature enhancement module (DFEM) are proposed to improve the network’s ability to describe small details. An adaptive global-local attention mechanism (AGAM) is used to enhance the weights of key features related to visual tasks. Specifically, we benchmarked the proposed algorithm by creating a new tri-modal sensor driving scene dataset (TSDS), which includes 15234 sets of labeled images (visible, thermal, and polarization degree images). The semantic segmentation model trained on our fusion images achieved 78.41% accuracy, and the object detection model achieved 87.21% MAP. The experimental results indicate that our algorithm outperforms the state-of-the-art image fusion algorithms.

查看原文本刊更多论文

基于扩散模型的高阶视觉任务可见光和热图像融合网络

融合技术通过结合互补信息，提高了安防、自动驾驶、军事监视、医疗成像、环境监测等应用的性能。可见光和热成像（RGB-T）图像的融合对于改善人类观察和视觉任务至关重要。然而，大多数语义驱动的融合算法的训练将分割和融合任务结合在一起，从而增加了计算成本，并且没有充分利用语义信息。设计一个更清晰的融合架构来挖掘丰富的深层语义特征是解决这一问题的关键。提出了一种带扩散模型的两级RGB-T图像融合网络。第一阶段，利用扩散模型提取多尺度特征；这为融合网络提供了丰富的语义特征和纹理边缘。下一步，提出语义特征增强模块（semantic feature enhancement module， SFEM）和细节特征增强模块（detail feature enhancement module， DFEM）来提高网络描述小细节的能力。采用自适应全局-局部注意机制（AGAM）增强与视觉任务相关的关键特征权重。具体来说，我们通过创建一个新的三模态传感器驾驶场景数据集（TSDS）来对所提出的算法进行基准测试，该数据集包括15234组标记图像（可见光、热成像和偏振度图像）。在我们的融合图像上训练的语义分割模型的准确率达到78.41%，目标检测模型的MAP达到87.21%。实验结果表明，该算法优于现有的图像融合算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.