Matryoshka Learning With Metric Transfer for Image-Text Matching

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-08 DOI:10.1109/TCSVT.2025.3558996

Pengzhe Wang;Lei Zhang;Zhendong Mao;Nenan Lyu;Yongdong Zhang

{"title":"Matryoshka Learning With Metric Transfer for Image-Text Matching","authors":"Pengzhe Wang;Lei Zhang;Zhendong Mao;Nenan Lyu;Yongdong Zhang","doi":"10.1109/TCSVT.2025.3558996","DOIUrl":null,"url":null,"abstract":"Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9502-9516"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10955419/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.

查看原文本刊更多论文

基于度量迁移的套娃学习用于图像-文本匹配

图像-文本匹配是一项重要的视觉语言任务技术，因为它弥合了视觉和文本模式之间的语义差距。虽然现有的方法已经取得了显著的进步，但在实际应用中，通常使用高维嵌入或集成方法来获得足够好的召回率或准确性，这大大增加了计算和存储成本。知识蒸馏可以帮助实现资源高效部署，但是现有的技术并不直接适用于跨模态匹配场景。主要困难来自两个方面：1)从教师模型到学生模型的升华通常分两个独立的阶段进行，这种学习目标的不一致性可能导致压缩结果的次优。2)从每个模态中独立提取知识不能保证保留在原始嵌入中建立的跨模态对齐，这可能导致压缩后的嵌入无法实现准确对齐。为了解决这些问题，我们提出了一种新的基于度量迁移框架（MAMET）的俄套字学习算法。在通过多个高维嵌入捕获多粒度信息后，提出了一种具有共享主干的高效的俄罗斯套娃训练过程，将不同粒度的信息压缩到一个低维嵌入中，实现了跨模态匹配和知识升华的一体化。同时，提出了一种新的度量转换准则，在不同维度和模态的嵌入空间中对度量关系进行多样化对齐，保证了精馏后的良好跨模态对齐。通过这种方式，我们的MAMET将高维集成模型的强大表示和泛化能力转移到基本网络中，不仅可以获得很大的性能提升，而且在在线推理过程中不会引入额外的开销。在基准数据集上进行的大量实验表明，我们的MAMET具有卓越的有效性和效率，在各种主干和领域中，与最先进的方法相比，平均性能提高了2%-20%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.