CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zheng Liu , Junhao Xu , Shanshan Gao , Zhumin Chen
{"title":"CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval","authors":"Zheng Liu ,&nbsp;Junhao Xu ,&nbsp;Shanshan Gao ,&nbsp;Zhumin Chen","doi":"10.1016/j.patcog.2025.111647","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: <span><span>https://github.com/xjh0805/CSA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111647"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003073","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: https://github.com/xjh0805/CSA.
基于自适应语义聚合和过滤的跨尺度对齐图像-文本检索
由于不同模式之间特征表示的不一致性,即“异构间隙”,使得图像-文本检索(ITR)成为一项具有挑战性的任务。为了弥补这一差距,在图像和文本的视觉部分和文本部分之间建立语义关联已被证明是ITR任务的有效策略。然而,现有的ITR方法侧重于通过在固定尺度上对齐视觉和文本部分来建立固定尺度的语义关联,即固定尺度对齐(fixed-scale alignment, FSA)。为了克服FSA的局限性,应该充分捕获存在于非固定尺度的视觉和文本部分之间的跨尺度语义关联。因此,为了通过引入无尺度约束的跨尺度对齐来提高现有图像-文本检索系统的性能,我们提出了一种新的跨尺度对齐(CSA)框架,通过深入探索跨尺度语义关联来加强图像和文本之间的联系。首先,为了构建具有尺度适应性的语义单元,提出了一种自适应语义聚合算法,生成位置感知子序列和共现感知子序列,并根据IoU值对子序列进行自适应合并;其次,针对尺度平衡和尺度不平衡对齐任务中的弱语义关联,提出了一种自适应语义过滤算法,该算法通过自适应确定概率密度分布的边界来学习两种类型的掩码矩阵;第三,为了学习准确的图像-文本相似度,提出了一种语义单元对齐策略,在各种不固定的尺度上自由对齐视觉和文本语义单元。大量的实验证明了CSA优于最先进的ITR方法。代码可在:https://github.com/xjh0805/CSA。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信