{"title":"改进视觉标记序列,以实现高效的图像字幕","authors":"Tiantao Xian , Zhiheng Zhou , Wenlve Zhou , Zhipeng Zhang","doi":"10.1016/j.neunet.2025.107759","DOIUrl":null,"url":null,"abstract":"<div><div>In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token <strong>R</strong>eduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token <strong>F</strong>usion and <strong>I</strong>nsertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"191 ","pages":"Article 107759"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Refining visual token sequence for efficient image captioning\",\"authors\":\"Tiantao Xian , Zhiheng Zhou , Wenlve Zhou , Zhipeng Zhang\",\"doi\":\"10.1016/j.neunet.2025.107759\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token <strong>R</strong>eduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token <strong>F</strong>usion and <strong>I</strong>nsertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"191 \",\"pages\":\"Article 107759\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025006392\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025006392","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Refining visual token sequence for efficient image captioning
In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token Reduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token Fusion and Insertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.