从外部到内部：用于图像文本检索的分步特征增强网络

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-09-05 DOI:10.1016/j.neunet.2025.108072

Jingyao Wang , Zheng Liu , Shanshan Gao , Junhao Xu , Changhao Li

{"title":"从外部到内部：用于图像文本检索的分步特征增强网络","authors":"Jingyao Wang , Zheng Liu , Shanshan Gao , Junhao Xu , Changhao Li","doi":"10.1016/j.neunet.2025.108072","DOIUrl":null,"url":null,"abstract":"<div><div>Image-Text Retrieval (ITR) is a challenging task due to the inherent inconsistency in feature representations across different modalities, commonly referred to as the “heterogeneity gap”. To bridge this gap, establishing stronger associations between images and texts by capturing semantic cues as comprehensively as possible is an effective approach. However, existing ITR methods cannot completely capture semantic cues derived from a large-scale image-text corpus beyond a single image-text pair. Therefore, we propose a two-layer Step-wise Feature Enhancement (SFE) Network to establish a semantic propagation pathway, guiding semantic information flow progressively from the external layer to the internal layer. In Step 1, External Semantic Cues (ESC) are captured from visual and textual semantic concepts based on patch-level, instance-level, and neighbor-level co-occurrences within an image-text corpus. Then, visual and textual features are enhanced in the external layer with ESC by mining co-occurrences at the patch, instance, and neighbor levels. Note that Instance-level and Neighbor-level co-occurrence belong to cross-modal ESC, which can significantly facilitate modality interaction in the external layer. In step 2, SFE first fuses semantic information propagated from step 1, and then enhances visual and textual features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context. Specifically, visual and textual features are concatenated with their corresponding cross-modal contextual features to further enhance modality interaction within the internal layer. Experimental results demonstrate the superiority of the proposed SFE network over state-of-the-art ITR methods.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"Article 108072"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From external to internal: Step-wise feature enhancement network for image-text retrieval\",\"authors\":\"Jingyao Wang , Zheng Liu , Shanshan Gao , Junhao Xu , Changhao Li\",\"doi\":\"10.1016/j.neunet.2025.108072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image-Text Retrieval (ITR) is a challenging task due to the inherent inconsistency in feature representations across different modalities, commonly referred to as the “heterogeneity gap”. To bridge this gap, establishing stronger associations between images and texts by capturing semantic cues as comprehensively as possible is an effective approach. However, existing ITR methods cannot completely capture semantic cues derived from a large-scale image-text corpus beyond a single image-text pair. Therefore, we propose a two-layer Step-wise Feature Enhancement (SFE) Network to establish a semantic propagation pathway, guiding semantic information flow progressively from the external layer to the internal layer. In Step 1, External Semantic Cues (ESC) are captured from visual and textual semantic concepts based on patch-level, instance-level, and neighbor-level co-occurrences within an image-text corpus. Then, visual and textual features are enhanced in the external layer with ESC by mining co-occurrences at the patch, instance, and neighbor levels. Note that Instance-level and Neighbor-level co-occurrence belong to cross-modal ESC, which can significantly facilitate modality interaction in the external layer. In step 2, SFE first fuses semantic information propagated from step 1, and then enhances visual and textual features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context. Specifically, visual and textual features are concatenated with their corresponding cross-modal contextual features to further enhance modality interaction within the internal layer. Experimental results demonstrate the superiority of the proposed SFE network over state-of-the-art ITR methods.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"193 \",\"pages\":\"Article 108072\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025009529\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009529","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

图像-文本检索（ITR）是一项具有挑战性的任务，因为不同模式的特征表示存在固有的不一致性，通常被称为“异质性差距”。为了弥补这一差距，通过尽可能全面地捕捉语义线索，在图像和文本之间建立更强的联系是一种有效的方法。然而，现有的ITR方法不能完全捕获来自单个图像文本对之外的大规模图像文本语料库的语义线索。因此，我们提出了一种两层阶梯特征增强（Step-wise Feature Enhancement， SFE）网络来建立语义传播路径，引导语义信息流从外部层逐步流向内部层。在步骤1中，基于图像-文本语料库中的补丁级、实例级和邻居级共现，从视觉和文本语义概念中捕获外部语义线索（ESC）。然后，通过挖掘补丁、实例和邻居级别上的共现现象，使用ESC在外部层增强视觉和文本特征。请注意，实例级和邻居级共现属于跨模态ESC，这可以显著促进外部层中的模态交互。在步骤2中，SFE首先融合从步骤1传播的语义信息，然后通过跨模态上下文挖掘内部语义线索（internal semantic Cues， ISC）来增强内层的视觉和文本特征。具体来说，视觉特征和文本特征与其相应的跨模态上下文特征相连接，以进一步增强内层内的模态交互。实验结果表明，所提出的SFE网络优于最先进的ITR方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

From external to internal: Step-wise feature enhancement network for image-text retrieval

Image-Text Retrieval (ITR) is a challenging task due to the inherent inconsistency in feature representations across different modalities, commonly referred to as the “heterogeneity gap”. To bridge this gap, establishing stronger associations between images and texts by capturing semantic cues as comprehensively as possible is an effective approach. However, existing ITR methods cannot completely capture semantic cues derived from a large-scale image-text corpus beyond a single image-text pair. Therefore, we propose a two-layer Step-wise Feature Enhancement (SFE) Network to establish a semantic propagation pathway, guiding semantic information flow progressively from the external layer to the internal layer. In Step 1, External Semantic Cues (ESC) are captured from visual and textual semantic concepts based on patch-level, instance-level, and neighbor-level co-occurrences within an image-text corpus. Then, visual and textual features are enhanced in the external layer with ESC by mining co-occurrences at the patch, instance, and neighbor levels. Note that Instance-level and Neighbor-level co-occurrence belong to cross-modal ESC, which can significantly facilitate modality interaction in the external layer. In step 2, SFE first fuses semantic information propagated from step 1, and then enhances visual and textual features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context. Specifically, visual and textual features are concatenated with their corresponding cross-modal contextual features to further enhance modality interaction within the internal layer. Experimental results demonstrate the superiority of the proposed SFE network over state-of-the-art ITR methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.