改进视觉标记序列，以实现高效的图像字幕

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-06-26 DOI:10.1016/j.neunet.2025.107759

Tiantao Xian , Zhiheng Zhou , Wenlve Zhou , Zhipeng Zhang

{"title":"改进视觉标记序列，以实现高效的图像字幕","authors":"Tiantao Xian , Zhiheng Zhou , Wenlve Zhou , Zhipeng Zhang","doi":"10.1016/j.neunet.2025.107759","DOIUrl":null,"url":null,"abstract":"<div><div>In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token <strong>R</strong>eduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token <strong>F</strong>usion and <strong>I</strong>nsertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"191 ","pages":"Article 107759"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Refining visual token sequence for efficient image captioning\",\"authors\":\"Tiantao Xian , Zhiheng Zhou , Wenlve Zhou , Zhipeng Zhang\",\"doi\":\"10.1016/j.neunet.2025.107759\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token <strong>R</strong>eduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token <strong>F</strong>usion and <strong>I</strong>nsertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"191 \",\"pages\":\"Article 107759\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025006392\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025006392","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在实际应用中，精度和速度对图像字幕（IC）系统至关重要。最近，基于变压器的架构显著地推动了集成电路领域的发展；然而，这些改进往往是以增加计算复杂性和降低推理速度为代价的。在本文中，我们对集成电路模型的计算开销进行了全面的分析，发现视觉编码过程占了这个开销的大部分。考虑到视觉信息中的冗余性，即许多区域不相关或提供的预测信息较少，我们提出了一种基于知识注入的视觉标记约简模块。该模块估计每个令牌的重要性，并仅保留其中的一个子集。为了最大限度地减少视觉语义损失，我们引入了标记融合和插入模块，通过重用丢弃的标记和捕获全局语义来补充视觉语义。基于此，我们的视觉令牌序列细化策略（RFI）被部署在视觉主干的特定位置，以分层压缩视觉令牌序列，从而减少模型在其源处的总体计算开销。大量的实验证明了该方法的有效性，表明它可以在不牺牲性能的情况下加速模型推理。此外，该方法允许在不同设置下的精度和速度之间进行灵活的权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Refining visual token sequence for efficient image captioning

In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token Reduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token Fusion and Insertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.