通用多模态图像匹配的模态不变特征学习

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-06 DOI:10.1109/TIP.2025.3574937

Yepeng Liu;Zhichao Sun;Baosheng Yu;Yitian Zhao;Bo Du;Yongchao Xu;Jun Cheng

{"title":"通用多模态图像匹配的模态不变特征学习","authors":"Yepeng Liu;Zhichao Sun;Baosheng Yu;Yitian Zhao;Bo Du;Yongchao Xu;Jun Cheng","doi":"10.1109/TIP.2025.3574937","DOIUrl":null,"url":null,"abstract":"Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at <uri>https://github.com/lyp-deeplearning/MIFNet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"3593-3608"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching\",\"authors\":\"Yepeng Liu;Zhichao Sun;Baosheng Yu;Yitian Zhao;Bo Du;Yongchao Xu;Jun Cheng\",\"doi\":\"10.1109/TIP.2025.3574937\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at <uri>https://github.com/lyp-deeplearning/MIFNet</uri>\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"3593-3608\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11024126/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11024126/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

针对图像匹配和配准问题，提出了许多关键点检测和描述方法。虽然这些方法在单模态图像匹配方面表现出良好的性能，但它们经常与多模态数据相斗争，因为在单模态数据上训练的描述符往往缺乏对多模态数据中存在的非线性变化的鲁棒性。将这种方法扩展到多模态图像匹配通常需要对齐良好的多模态数据来学习模态不变描述符。然而，在许多现实场景中，获取此类数据通常是昂贵且不切实际的。为了解决这一挑战，我们提出了一种模态不变特征学习网络（MIFNet），用于仅使用单模态训练数据计算多模态图像匹配中关键点描述的模态不变特征。具体来说，我们提出了一个新的潜在特征聚合模块和一个累积混合聚合模块，通过利用来自稳定扩散模型的预训练特征来增强在单模态数据上训练的基本关键点描述符。我们在三个多模态视网膜图像数据集（CF-FA, CF-OCT, EMA-OCTA）和两个遥感数据集（Optical-SAR和Optical-NIR）中验证了最新的关键点检测和描述方法。大量实验表明，该算法能够在不访问目标模态的情况下学习模态不变特征进行多模态图像匹配，并具有良好的零点泛化能力。代码将在https://github.com/lyp-deeplearning/MIFNet上发布

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching

Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at https://github.com/lyp-deeplearning/MIFNet

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量