{"title":"CSA: Cross-scale alignment with adaptive semantic aggregation and filter for image–text retrieval","authors":"Zheng Liu , Junhao Xu , Shanshan Gao , Zhumin Chen","doi":"10.1016/j.patcog.2025.111647","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: <span><span>https://github.com/xjh0805/CSA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111647"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003073","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Due to the inconsistency in feature representations between different modalities, known as the “Heterogeneous gap”, image–text retrieval (ITR) is a challenging task. To bridge this gap, establishing semantic associations between visual and textual parts of images and texts has been proven to be an effective strategy for the ITR task. However, existing ITR methods focus on establishing fixed-scale semantic associations by aligning visual and textual parts at fixed scales, namely, fixed-scale alignment (FSA). To overcome the limitations of FSA, cross-scale semantic associations, which exist between visual and textual parts at unfixed scales, should be sufficiently captured. Therefore, to achieve the objective of improving the performance of current image–text retrieval systems by introducing cross-scale alignment without scale constraints, we propose a novel cross-scale alignment (CSA) framework to strengthen connections between images and texts via thoroughly exploring cross-scale semantic associations. Firstly, to construct scale-adaptable semantic units, an adaptive semantic aggregation algorithm is developed, which generates both position-aware and co-occurrence-aware subsequences, and then adaptively merges them according to IoU values. Secondly, to filter out weak semantic associations in both the scale-balanced and scale-unbalanced alignment tasks, an adaptive semantic filter algorithm is presented, which learns two types of mask matrices by adaptively determining boundaries in probability density distributions. Thirdly, to learn accurate image–text similarity, a semantic unit alignment strategy is proposed to freely align visual and textual semantic units across various unfixed scales. Extensive experiments demonstrate the superiority of CSA over state-of-the-art ITR methods. Code available at: https://github.com/xjh0805/CSA.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.