Unified learning for image–text alignment via multi-scale feature fusion

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-08-25 DOI:10.1016/j.cviu.2025.104468

Jing Zhou , Meng Wang

{"title":"Unified learning for image–text alignment via multi-scale feature fusion","authors":"Jing Zhou , Meng Wang","doi":"10.1016/j.cviu.2025.104468","DOIUrl":null,"url":null,"abstract":"<div><div>Cross-modal retrieval, particularly image–text retrieval, aims to achieve efficient matching and retrieval between images and text. With the continuous advancement of deep learning technologies, numerous innovative models and algorithms have emerged. However, existing methods still face some limitations: (1) Most models overly focus on either global or local correspondences, failing to fully integrate global and local information; (2) They typically emphasize cross-modal similarity optimization while neglecting the relationships among samples within the same modality; (3) They struggle to effectively handle noise in image–text pairs, negatively impacting model performance due to noisy negative samples. To address these challenges, this paper proposes a dual-branch structured model that combines global and local matching—Momentum-Augmented Transformer Encoder (MATE). The model aligns closely with human cognitive processes by integrating global and local features and leveraging an External Spatial Attention aggregation (ESA) mechanism and a Multi-modal Fusion Transformer Encoder, significantly enhancing feature representation capabilities. Furthermore, this work introduces a Hard Enhanced Contrastive Triplet Loss (HECT Loss), which effectively optimizes the model’s ability to distinguish positive and negative samples. A self-supervised learning method based on momentum distillation is also employed to further improve image–text matching performance. The experimental results demonstrate that the MATE model outperforms the vast majority of existing state-of-the-art methods on both Flickr30K and MS-COCO datasets. The code is available at <span><span>https://github.com/wangmeng-007/MATE/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104468"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001912","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-modal retrieval, particularly image–text retrieval, aims to achieve efficient matching and retrieval between images and text. With the continuous advancement of deep learning technologies, numerous innovative models and algorithms have emerged. However, existing methods still face some limitations: (1) Most models overly focus on either global or local correspondences, failing to fully integrate global and local information; (2) They typically emphasize cross-modal similarity optimization while neglecting the relationships among samples within the same modality; (3) They struggle to effectively handle noise in image–text pairs, negatively impacting model performance due to noisy negative samples. To address these challenges, this paper proposes a dual-branch structured model that combines global and local matching—Momentum-Augmented Transformer Encoder (MATE). The model aligns closely with human cognitive processes by integrating global and local features and leveraging an External Spatial Attention aggregation (ESA) mechanism and a Multi-modal Fusion Transformer Encoder, significantly enhancing feature representation capabilities. Furthermore, this work introduces a Hard Enhanced Contrastive Triplet Loss (HECT Loss), which effectively optimizes the model’s ability to distinguish positive and negative samples. A self-supervised learning method based on momentum distillation is also employed to further improve image–text matching performance. The experimental results demonstrate that the MATE model outperforms the vast majority of existing state-of-the-art methods on both Flickr30K and MS-COCO datasets. The code is available at https://github.com/wangmeng-007/MATE/tree/master.

查看原文本刊更多论文

基于多尺度特征融合的图像-文本对齐统一学习

跨模态检索，特别是图像-文本检索，旨在实现图像和文本之间的高效匹配和检索。随着深度学习技术的不断进步，出现了许多创新的模型和算法。然而，现有方法仍然存在一些局限性：(1)大多数模型过于关注全局对应或局部对应，未能充分整合全局和局部信息；(2)它们通常强调跨模态相似性优化，而忽略了同一模态内样本之间的关系；(3)它们难以有效地处理图像-文本对中的噪声，由于噪声负样本对模型性能产生负面影响。为了解决这些问题，本文提出了一种结合全局和局部匹配的双分支结构模型-动量增强变压器编码器（MATE）。该模型通过整合全局和局部特征，利用外部空间注意聚合（ESA）机制和多模态融合变压器编码器，与人类认知过程紧密结合，显著增强了特征表示能力。此外，本工作引入了硬增强对比三重态损失（HECT Loss），有效地优化了模型区分阳性和阴性样本的能力。采用基于动量蒸馏的自监督学习方法进一步提高了图像-文本匹配性能。实验结果表明，MATE模型在Flickr30K和MS-COCO数据集上都优于绝大多数现有的最先进的方法。代码可在https://github.com/wangmeng-007/MATE/tree/master上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems