Yichen Liu, Junjie Ye, Wangpeng He, Zhiqiang Qu, Ruoxuan Xu
{"title":"NTFNet: RGB-TIR语义分割的先窄后融合网络","authors":"Yichen Liu, Junjie Ye, Wangpeng He, Zhiqiang Qu, Ruoxuan Xu","doi":"10.1007/s10489-025-06411-7","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, the task of understanding scenes in visible (RGB) and thermal-infrared (TIR) images has garnered increasing interest in the field of computer vision. However, most existing methods employ simplistic fusion strategies to merge features from different modalities. These strategies often overlook the differences in shallow-level features between modalities, thereby reducing the discriminability of the fused features and resulting in suboptimal segmentation performance. To address this issue, we present a novel RGB-TIR semantic segmentation framework, named NTFNet. This framework exploits the potential consistency of semantic-level features to rectify shallow-level features and reduce discrepancies between modalities prior to integration. Specifically, auxiliary encoders are employed at each layer to capture semantically consistent information. To obtain rich multi-modal semantic features, we designed a High-Level Feature Fusion Module (HFFM) that enhances feature representation in both channel and spatial dimensions. Subsequently, the Shallow Feature Difference Rectification Module (SFDRM) is introduced to rectify the difference in shallow-level features. To address the loss of detailed information during the rectification process, the SFDRM incorporates a Detail Attention Mechanism (DAM) to preserve the original detail information, thereby further optimizing the final segmentation results. In the end, a Multi-Scale Feature Fusion module (Multi-Scale FFM) is designed to combine the rectified features. Comprehensive experiments on two public RGB-TIR datasets show that our method significantly outperforms other state-of-the-art approaches in terms of performance.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 6","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10489-025-06411-7.pdf","citationCount":"0","resultStr":"{\"title\":\"NTFNet: Narrowing-Then-Fusing network for RGB-TIR semantic segmentation\",\"authors\":\"Yichen Liu, Junjie Ye, Wangpeng He, Zhiqiang Qu, Ruoxuan Xu\",\"doi\":\"10.1007/s10489-025-06411-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In recent years, the task of understanding scenes in visible (RGB) and thermal-infrared (TIR) images has garnered increasing interest in the field of computer vision. However, most existing methods employ simplistic fusion strategies to merge features from different modalities. These strategies often overlook the differences in shallow-level features between modalities, thereby reducing the discriminability of the fused features and resulting in suboptimal segmentation performance. To address this issue, we present a novel RGB-TIR semantic segmentation framework, named NTFNet. This framework exploits the potential consistency of semantic-level features to rectify shallow-level features and reduce discrepancies between modalities prior to integration. Specifically, auxiliary encoders are employed at each layer to capture semantically consistent information. To obtain rich multi-modal semantic features, we designed a High-Level Feature Fusion Module (HFFM) that enhances feature representation in both channel and spatial dimensions. Subsequently, the Shallow Feature Difference Rectification Module (SFDRM) is introduced to rectify the difference in shallow-level features. To address the loss of detailed information during the rectification process, the SFDRM incorporates a Detail Attention Mechanism (DAM) to preserve the original detail information, thereby further optimizing the final segmentation results. In the end, a Multi-Scale Feature Fusion module (Multi-Scale FFM) is designed to combine the rectified features. Comprehensive experiments on two public RGB-TIR datasets show that our method significantly outperforms other state-of-the-art approaches in terms of performance.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 6\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-03-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s10489-025-06411-7.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06411-7\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06411-7","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
NTFNet: Narrowing-Then-Fusing network for RGB-TIR semantic segmentation
In recent years, the task of understanding scenes in visible (RGB) and thermal-infrared (TIR) images has garnered increasing interest in the field of computer vision. However, most existing methods employ simplistic fusion strategies to merge features from different modalities. These strategies often overlook the differences in shallow-level features between modalities, thereby reducing the discriminability of the fused features and resulting in suboptimal segmentation performance. To address this issue, we present a novel RGB-TIR semantic segmentation framework, named NTFNet. This framework exploits the potential consistency of semantic-level features to rectify shallow-level features and reduce discrepancies between modalities prior to integration. Specifically, auxiliary encoders are employed at each layer to capture semantically consistent information. To obtain rich multi-modal semantic features, we designed a High-Level Feature Fusion Module (HFFM) that enhances feature representation in both channel and spatial dimensions. Subsequently, the Shallow Feature Difference Rectification Module (SFDRM) is introduced to rectify the difference in shallow-level features. To address the loss of detailed information during the rectification process, the SFDRM incorporates a Detail Attention Mechanism (DAM) to preserve the original detail information, thereby further optimizing the final segmentation results. In the end, a Multi-Scale Feature Fusion module (Multi-Scale FFM) is designed to combine the rectified features. Comprehensive experiments on two public RGB-TIR datasets show that our method significantly outperforms other state-of-the-art approaches in terms of performance.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.