A Unified Optimal Transport Framework for Cross-Modal Retrieval With Noisy Labels

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-04-30 DOI:10.1109/TNNLS.2025.3559533

Haochen Han;Minnan Luo;Huan Liu;Fang Nan;Jun Liu

{"title":"A Unified Optimal Transport Framework for Cross-Modal Retrieval With Noisy Labels","authors":"Haochen Han;Minnan Luo;Huan Liu;Fang Nan;Jun Liu","doi":"10.1109/TNNLS.2025.3559533","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges—enforcing the multimodal samples to <italic>align incorrect semantics</i> and <italic>widen the heterogeneous gap</i>, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a unified framework based on optimal transport (OT) for robust CMR. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multimodal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these components leverage the inherent correlation among multimodal data to facilitate effective cost function. The experiments on three widely used CMR datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 9","pages":"16435-16448"},"PeriodicalIF":8.9000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10981486/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-modal retrieval (CMR) aims to establish interaction between different modalities, among which supervised CMR is emerging due to its flexibility in learning semantic category discrimination. Despite the remarkable performance of previous supervised CMR methods, much of their success can be attributed to the well-annotated data. However, even for unimodal data, precise annotation is expensive and time-consuming, and it becomes more challenging with the multimodal scenario. In practice, massive multimodal data are collected from the Internet with coarse annotation, which inevitably introduces noisy labels. Training with such misleading labels would bring two key challenges—enforcing the multimodal samples to align incorrect semantics and widen the heterogeneous gap, resulting in poor retrieval performance. To tackle these challenges, this work proposes UOT-RCL, a unified framework based on optimal transport (OT) for robust CMR. First, we propose a semantic alignment based on partial OT to progressively correct the noisy labels, where a novel cross-modal consistent cost function is designed to blend different modalities and provide precise transport cost. Second, to narrow the discrepancy in multimodal data, an OT-based relation alignment is proposed to infer the semantic-level cross-modal matching. Both of these components leverage the inherent correlation among multimodal data to facilitate effective cost function. The experiments on three widely used CMR datasets demonstrate that our UOT-RCL surpasses the state-of-the-art approaches and significantly improves the robustness against noisy labels.

查看原文本刊更多论文

带噪声标签跨模态检索的统一最优传输框架

跨模态检索（Cross-modal retrieval， CMR）旨在建立不同模态之间的交互作用，其中有监督的跨模态检索由于其在学习语义类别判别方面的灵活性而逐渐兴起。尽管以前的监督CMR方法表现出色，但它们的成功很大程度上归因于良好注释的数据。然而，即使对于单模态数据，精确的注释也是昂贵且耗时的，并且在多模态场景中变得更具挑战性。在实际应用中，大量的多模态数据是通过粗糙的标注从互联网上收集的，这不可避免地引入了噪声标签。使用这种误导性标签进行训练将带来两个关键挑战：强迫多模态样本对齐错误的语义和扩大异构差距，从而导致检索性能差。为了应对这些挑战，本工作提出了UOT-RCL，这是一个基于最优传输（OT）的统一框架，用于稳健的CMR。首先，我们提出了一种基于部分OT的语义对齐方法来逐步纠正噪声标签，其中设计了一种新的跨模式一致成本函数来混合不同的模式并提供精确的运输成本。其次，为了缩小多模态数据的差异，提出了一种基于语义的关系对齐方法来推断语义层面的跨模态匹配。这两个组件都利用多模态数据之间的内在相关性来促进有效的成本函数。在三个广泛使用的CMR数据集上的实验表明，我们的UOT-RCL超越了最先进的方法，并显着提高了对噪声标签的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.