{"title":"Matryoshka Learning With Metric Transfer for Image-Text Matching","authors":"Pengzhe Wang;Lei Zhang;Zhendong Mao;Nenan Lyu;Yongdong Zhang","doi":"10.1109/TCSVT.2025.3558996","DOIUrl":null,"url":null,"abstract":"Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9502-9516"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10955419/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.