{"title":"A Unified Optimal Transport Framework for Cross-Modal Retrieval With Noisy Labels","authors":"Haochen Han;Minnan Luo;Huan Liu;Fang Nan;Jun Liu","doi":"10.1109/TNNLS.2025.3559533","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges—enforcing the multimodal samples to <italic>align incorrect semantics</i> and <italic>widen the heterogeneous gap</i>, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a unified framework based on optimal transport (OT) for robust CMR. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multimodal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these components leverage the inherent correlation among multimodal data to facilitate effective cost function. The experiments on three widely used CMR datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 9","pages":"16435-16448"},"PeriodicalIF":8.9000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10981486/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges—enforcing the multimodal samples to align incorrect semantics and widen the heterogeneous gap, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a unified framework based on optimal transport (OT) for robust CMR. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multimodal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these components leverage the inherent correlation among multimodal data to facilitate effective cost function. The experiments on three widely used CMR datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.