{"title":"Unified learning for image–text alignment via multi-scale feature fusion","authors":"Jing Zhou , Meng Wang","doi":"10.1016/j.cviu.2025.104468","DOIUrl":null,"url":null,"abstract":"<div><div>Cross-modal retrieval, particularly image–text retrieval, aims to achieve efficient matching and retrieval between images and text. With the continuous advancement of deep learning technologies, numerous innovative models and algorithms have emerged. However, existing methods still face some limitations: (1) Most models overly focus on either global or local correspondences, failing to fully integrate global and local information; (2) They typically emphasize cross-modal similarity optimization while neglecting the relationships among samples within the same modality; (3) They struggle to effectively handle noise in image–text pairs, negatively impacting model performance due to noisy negative samples. To address these challenges, this paper proposes a dual-branch structured model that combines global and local matching—Momentum-Augmented Transformer Encoder (MATE). The model aligns closely with human cognitive processes by integrating global and local features and leveraging an External Spatial Attention aggregation (ESA) mechanism and a Multi-modal Fusion Transformer Encoder, significantly enhancing feature representation capabilities. Furthermore, this work introduces a Hard Enhanced Contrastive Triplet Loss (HECT Loss), which effectively optimizes the model’s ability to distinguish positive and negative samples. A self-supervised learning method based on momentum distillation is also employed to further improve image–text matching performance. The experimental results demonstrate that the MATE model outperforms the vast majority of existing state-of-the-art methods on both Flickr30K and MS-COCO datasets. The code is available at <span><span>https://github.com/wangmeng-007/MATE/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104468"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001912","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-modal retrieval, particularly image–text retrieval, aims to achieve efficient matching and retrieval between images and text. With the continuous advancement of deep learning technologies, numerous innovative models and algorithms have emerged. However, existing methods still face some limitations: (1) Most models overly focus on either global or local correspondences, failing to fully integrate global and local information; (2) They typically emphasize cross-modal similarity optimization while neglecting the relationships among samples within the same modality; (3) They struggle to effectively handle noise in image–text pairs, negatively impacting model performance due to noisy negative samples. To address these challenges, this paper proposes a dual-branch structured model that combines global and local matching—Momentum-Augmented Transformer Encoder (MATE). The model aligns closely with human cognitive processes by integrating global and local features and leveraging an External Spatial Attention aggregation (ESA) mechanism and a Multi-modal Fusion Transformer Encoder, significantly enhancing feature representation capabilities. Furthermore, this work introduces a Hard Enhanced Contrastive Triplet Loss (HECT Loss), which effectively optimizes the model’s ability to distinguish positive and negative samples. A self-supervised learning method based on momentum distillation is also employed to further improve image–text matching performance. The experimental results demonstrate that the MATE model outperforms the vast majority of existing state-of-the-art methods on both Flickr30K and MS-COCO datasets. The code is available at https://github.com/wangmeng-007/MATE/tree/master.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems