{"title":"Learning Modality-Invariant Feature for Multimodal Image Matching via Knowledge Distillation","authors":"Yepeng Liu;Wenpeng Lai;Yuliang Gu;Gui-Song Xia;Bo Du;Yongchao Xu","doi":"10.1109/TGRS.2025.3568003","DOIUrl":null,"url":null,"abstract":"Multimodal remote sensing image matching is essential for multisource information fusion. Recently, learning-based feature matching networks have significantly enhanced the performance of unimodal image matching tasks through data-driven approaches. However, progress in applying these learning-based methods to multimodal image matching has been slower. A major obstacle is the substantial nonlinear radiometric differences between modalities, which require networks to learn modality-invariant features from large amounts of paired data. To address this, we propose EMINet, an efficient method for learning modality-invariant features from limited data to improve matching performance. Our approach constructs a high-performance teacher network by combining the DINOv2 foundational model, the keypoint and descriptor extraction network SuperPoint, and the feature matching network SuperGlue. Leveraging the strong semantic representation capability of DINOv2, the teacher network achieves excellent cross-modality matching ability. To meet low-latency requirements in practical applications, we introduce two novel knowledge distillation strategies: semantic window relation distillation (SWRD) and cross-triplet descriptor distillation (CTDD). SWRD improves the discriminative power of the student network’s descriptors by learning patch-level distributions from DINOv2, while CTDD enforces cross-modality triplet constraints to enhance modality invariance of the student network. Experimental results demonstrate that EMINet outperforms several state-of-the-art methods on various datasets, including optical-SAR, optical-NIR, and optical-IR datasets.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10994265/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal remote sensing image matching is essential for multisource information fusion. Recently, learning-based feature matching networks have significantly enhanced the performance of unimodal image matching tasks through data-driven approaches. However, progress in applying these learning-based methods to multimodal image matching has been slower. A major obstacle is the substantial nonlinear radiometric differences between modalities, which require networks to learn modality-invariant features from large amounts of paired data. To address this, we propose EMINet, an efficient method for learning modality-invariant features from limited data to improve matching performance. Our approach constructs a high-performance teacher network by combining the DINOv2 foundational model, the keypoint and descriptor extraction network SuperPoint, and the feature matching network SuperGlue. Leveraging the strong semantic representation capability of DINOv2, the teacher network achieves excellent cross-modality matching ability. To meet low-latency requirements in practical applications, we introduce two novel knowledge distillation strategies: semantic window relation distillation (SWRD) and cross-triplet descriptor distillation (CTDD). SWRD improves the discriminative power of the student network’s descriptors by learning patch-level distributions from DINOv2, while CTDD enforces cross-modality triplet constraints to enhance modality invariance of the student network. Experimental results demonstrate that EMINet outperforms several state-of-the-art methods on various datasets, including optical-SAR, optical-NIR, and optical-IR datasets.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.