CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-04-05 DOI:10.1016/j.patcog.2025.111647

Zheng Liu , Junhao Xu , Shanshan Gao , Zhumin Chen

{"title":"CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval","authors":"Zheng Liu , Junhao Xu , Shanshan Gao , Zhumin Chen","doi":"10.1016/j.patcog.2025.111647","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: <span><span>https://github.com/xjh0805/CSA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111647"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003073","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: https://github.com/xjh0805/CSA.

查看原文本刊更多论文

基于自适应语义聚合和过滤的跨尺度对齐图像-文本检索

由于不同模式之间特征表示的不一致性，即“异构间隙”，使得图像-文本检索（ITR）成为一项具有挑战性的任务。为了弥补这一差距，在图像和文本的视觉部分和文本部分之间建立语义关联已被证明是ITR任务的有效策略。然而，现有的ITR方法侧重于通过在固定尺度上对齐视觉和文本部分来建立固定尺度的语义关联，即固定尺度对齐（fixed-scale alignment， FSA）。为了克服FSA的局限性，应该充分捕获存在于非固定尺度的视觉和文本部分之间的跨尺度语义关联。因此，为了通过引入无尺度约束的跨尺度对齐来提高现有图像-文本检索系统的性能，我们提出了一种新的跨尺度对齐（CSA）框架，通过深入探索跨尺度语义关联来加强图像和文本之间的联系。首先，为了构建具有尺度适应性的语义单元，提出了一种自适应语义聚合算法，生成位置感知子序列和共现感知子序列，并根据IoU值对子序列进行自适应合并；其次，针对尺度平衡和尺度不平衡对齐任务中的弱语义关联，提出了一种自适应语义过滤算法，该算法通过自适应确定概率密度分布的边界来学习两种类型的掩码矩阵；第三，为了学习准确的图像-文本相似度，提出了一种语义单元对齐策略，在各种不固定的尺度上自由对齐视觉和文本语义单元。大量的实验证明了CSA优于最先进的ITR方法。代码可在：https://github.com/xjh0805/CSA。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.