Jingyao Wang , Zheng Liu , Shanshan Gao , Junhao Xu , Changhao Li
{"title":"从外部到内部:用于图像文本检索的分步特征增强网络","authors":"Jingyao Wang , Zheng Liu , Shanshan Gao , Junhao Xu , Changhao Li","doi":"10.1016/j.neunet.2025.108072","DOIUrl":null,"url":null,"abstract":"<div><div>Image-Text Retrieval (ITR) is a challenging task due to the inherent inconsistency in feature representations across different modalities, commonly referred to as the “heterogeneity gap”. To bridge this gap, establishing stronger associations between images and texts by capturing semantic cues as comprehensively as possible is an effective approach. However, existing ITR methods cannot completely capture semantic cues derived from a large-scale image-text corpus beyond a single image-text pair. Therefore, we propose a two-layer Step-wise Feature Enhancement (SFE) Network to establish a semantic propagation pathway, guiding semantic information flow progressively from the external layer to the internal layer. In Step 1, External Semantic Cues (ESC) are captured from visual and textual semantic concepts based on patch-level, instance-level, and neighbor-level co-occurrences within an image-text corpus. Then, visual and textual features are enhanced in the external layer with ESC by mining co-occurrences at the patch, instance, and neighbor levels. Note that Instance-level and Neighbor-level co-occurrence belong to cross-modal ESC, which can significantly facilitate modality interaction in the external layer. In step 2, SFE first fuses semantic information propagated from step 1, and then enhances visual and textual features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context. Specifically, visual and textual features are concatenated with their corresponding cross-modal contextual features to further enhance modality interaction within the internal layer. Experimental results demonstrate the superiority of the proposed SFE network over state-of-the-art ITR methods.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"Article 108072"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From external to internal: Step-wise feature enhancement network for image-text retrieval\",\"authors\":\"Jingyao Wang , Zheng Liu , Shanshan Gao , Junhao Xu , Changhao Li\",\"doi\":\"10.1016/j.neunet.2025.108072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Image-Text Retrieval (ITR) is a challenging task due to the inherent inconsistency in feature representations across different modalities, commonly referred to as the “heterogeneity gap”. To bridge this gap, establishing stronger associations between images and texts by capturing semantic cues as comprehensively as possible is an effective approach. However, existing ITR methods cannot completely capture semantic cues derived from a large-scale image-text corpus beyond a single image-text pair. Therefore, we propose a two-layer Step-wise Feature Enhancement (SFE) Network to establish a semantic propagation pathway, guiding semantic information flow progressively from the external layer to the internal layer. In Step 1, External Semantic Cues (ESC) are captured from visual and textual semantic concepts based on patch-level, instance-level, and neighbor-level co-occurrences within an image-text corpus. Then, visual and textual features are enhanced in the external layer with ESC by mining co-occurrences at the patch, instance, and neighbor levels. Note that Instance-level and Neighbor-level co-occurrence belong to cross-modal ESC, which can significantly facilitate modality interaction in the external layer. In step 2, SFE first fuses semantic information propagated from step 1, and then enhances visual and textual features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context. Specifically, visual and textual features are concatenated with their corresponding cross-modal contextual features to further enhance modality interaction within the internal layer. Experimental results demonstrate the superiority of the proposed SFE network over state-of-the-art ITR methods.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"193 \",\"pages\":\"Article 108072\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025009529\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009529","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
From external to internal: Step-wise feature enhancement network for image-text retrieval
Image-Text Retrieval (ITR) is a challenging task due to the inherent inconsistency in feature representations across different modalities, commonly referred to as the “heterogeneity gap”. To bridge this gap, establishing stronger associations between images and texts by capturing semantic cues as comprehensively as possible is an effective approach. However, existing ITR methods cannot completely capture semantic cues derived from a large-scale image-text corpus beyond a single image-text pair. Therefore, we propose a two-layer Step-wise Feature Enhancement (SFE) Network to establish a semantic propagation pathway, guiding semantic information flow progressively from the external layer to the internal layer. In Step 1, External Semantic Cues (ESC) are captured from visual and textual semantic concepts based on patch-level, instance-level, and neighbor-level co-occurrences within an image-text corpus. Then, visual and textual features are enhanced in the external layer with ESC by mining co-occurrences at the patch, instance, and neighbor levels. Note that Instance-level and Neighbor-level co-occurrence belong to cross-modal ESC, which can significantly facilitate modality interaction in the external layer. In step 2, SFE first fuses semantic information propagated from step 1, and then enhances visual and textual features in the internal layer by mining Internal Semantic Cues (ISC) through cross-modal context. Specifically, visual and textual features are concatenated with their corresponding cross-modal contextual features to further enhance modality interaction within the internal layer. Experimental results demonstrate the superiority of the proposed SFE network over state-of-the-art ITR methods.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.