Learning Modality-Invariant Feature for Multimodal Image Matching via Knowledge Distillation

IF 8.6 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2025-03-09 DOI:10.1109/TGRS.2025.3568003

Yepeng Liu;Wenpeng Lai;Yuliang Gu;Gui-Song Xia;Bo Du;Yongchao Xu

{"title":"Learning Modality-Invariant Feature for Multimodal Image Matching via Knowledge Distillation","authors":"Yepeng Liu;Wenpeng Lai;Yuliang Gu;Gui-Song Xia;Bo Du;Yongchao Xu","doi":"10.1109/TGRS.2025.3568003","DOIUrl":null,"url":null,"abstract":"Multimodal remote sensing image matching is essential for multisource information fusion. Recently, learning-based feature matching networks have significantly enhanced the performance of unimodal image matching tasks through data-driven approaches. However, progress in applying these learning-based methods to multimodal image matching has been slower. A major obstacle is the substantial nonlinear radiometric differences between modalities, which require networks to learn modality-invariant features from large amounts of paired data. To address this, we propose EMINet, an efficient method for learning modality-invariant features from limited data to improve matching performance. Our approach constructs a high-performance teacher network by combining the DINOv2 foundational model, the keypoint and descriptor extraction network SuperPoint, and the feature matching network SuperGlue. Leveraging the strong semantic representation capability of DINOv2, the teacher network achieves excellent cross-modality matching ability. To meet low-latency requirements in practical applications, we introduce two novel knowledge distillation strategies: semantic window relation distillation (SWRD) and cross-triplet descriptor distillation (CTDD). SWRD improves the discriminative power of the student network’s descriptors by learning patch-level distributions from DINOv2, while CTDD enforces cross-modality triplet constraints to enhance modality invariance of the student network. Experimental results demonstrate that EMINet outperforms several state-of-the-art methods on various datasets, including optical-SAR, optical-NIR, and optical-IR datasets.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10994265/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal remote sensing image matching is essential for multisource information fusion. Recently, learning-based feature matching networks have significantly enhanced the performance of unimodal image matching tasks through data-driven approaches. However, progress in applying these learning-based methods to multimodal image matching has been slower. A major obstacle is the substantial nonlinear radiometric differences between modalities, which require networks to learn modality-invariant features from large amounts of paired data. To address this, we propose EMINet, an efficient method for learning modality-invariant features from limited data to improve matching performance. Our approach constructs a high-performance teacher network by combining the DINOv2 foundational model, the keypoint and descriptor extraction network SuperPoint, and the feature matching network SuperGlue. Leveraging the strong semantic representation capability of DINOv2, the teacher network achieves excellent cross-modality matching ability. To meet low-latency requirements in practical applications, we introduce two novel knowledge distillation strategies: semantic window relation distillation (SWRD) and cross-triplet descriptor distillation (CTDD). SWRD improves the discriminative power of the student network’s descriptors by learning patch-level distributions from DINOv2, while CTDD enforces cross-modality triplet constraints to enhance modality invariance of the student network. Experimental results demonstrate that EMINet outperforms several state-of-the-art methods on various datasets, including optical-SAR, optical-NIR, and optical-IR datasets.

查看原文本刊更多论文

基于知识蒸馏的多模态图像匹配学习模态不变特征

多模态遥感图像匹配是多源信息融合的关键。近年来，基于学习的特征匹配网络通过数据驱动的方法显著提高了单峰图像匹配任务的性能。然而，将这些基于学习的方法应用于多模态图像匹配的进展较慢。一个主要的障碍是模态之间的非线性辐射差异，这需要网络从大量成对数据中学习模态不变特征。为了解决这个问题，我们提出了EMINet，一种从有限数据中学习模态不变特征以提高匹配性能的有效方法。该方法结合DINOv2基础模型、关键点和描述符提取网络SuperPoint以及特征匹配网络SuperGlue，构建了一个高性能的教师网络。利用DINOv2强大的语义表示能力，教师网络实现了出色的跨模态匹配能力。为了满足实际应用中的低延迟需求，我们引入了两种新的知识蒸馏策略：语义窗口关系蒸馏（SWRD）和交叉三元描述符蒸馏（CTDD）。SWRD通过从DINOv2中学习补丁级分布来提高学生网络描述符的判别能力，而CTDD通过增强跨模态三元组约束来增强学生网络的模态不变性。实验结果表明，EMINet在各种数据集上优于几种最先进的方法，包括光学sar、光学近红外和光学红外数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.