改进视觉标记序列,以实现高效的图像字幕

IF 6.3 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Tiantao Xian , Zhiheng Zhou , Wenlve Zhou , Zhipeng Zhang
{"title":"改进视觉标记序列,以实现高效的图像字幕","authors":"Tiantao Xian ,&nbsp;Zhiheng Zhou ,&nbsp;Wenlve Zhou ,&nbsp;Zhipeng Zhang","doi":"10.1016/j.neunet.2025.107759","DOIUrl":null,"url":null,"abstract":"<div><div>In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token <strong>R</strong>eduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token <strong>F</strong>usion and <strong>I</strong>nsertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"191 ","pages":"Article 107759"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Refining visual token sequence for efficient image captioning\",\"authors\":\"Tiantao Xian ,&nbsp;Zhiheng Zhou ,&nbsp;Wenlve Zhou ,&nbsp;Zhipeng Zhang\",\"doi\":\"10.1016/j.neunet.2025.107759\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token <strong>R</strong>eduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token <strong>F</strong>usion and <strong>I</strong>nsertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"191 \",\"pages\":\"Article 107759\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025006392\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025006392","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在实际应用中,精度和速度对图像字幕(IC)系统至关重要。最近,基于变压器的架构显著地推动了集成电路领域的发展;然而,这些改进往往是以增加计算复杂性和降低推理速度为代价的。在本文中,我们对集成电路模型的计算开销进行了全面的分析,发现视觉编码过程占了这个开销的大部分。考虑到视觉信息中的冗余性,即许多区域不相关或提供的预测信息较少,我们提出了一种基于知识注入的视觉标记约简模块。该模块估计每个令牌的重要性,并仅保留其中的一个子集。为了最大限度地减少视觉语义损失,我们引入了标记融合和插入模块,通过重用丢弃的标记和捕获全局语义来补充视觉语义。基于此,我们的视觉令牌序列细化策略(RFI)被部署在视觉主干的特定位置,以分层压缩视觉令牌序列,从而减少模型在其源处的总体计算开销。大量的实验证明了该方法的有效性,表明它可以在不牺牲性能的情况下加速模型推理。此外,该方法允许在不同设置下的精度和速度之间进行灵活的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Refining visual token sequence for efficient image captioning
In practical applications, both accuracy and speed are critical for image captioning (IC) systems. Recently, transformer-based architectures have significantly advanced the field of IC; however, these improvements often come at the cost of increased computational complexity and slower inference speeds. In this paper, we conduct a comprehensive analysis of the computational overhead of IC models and find that the visual encoding process accounts for the majority of this overhead. Considering the redundancy in visual information — where many regions are irrelevant or provide low information for prediction — we propose a knowledge-injection-based visual token Reduction module. This module estimates the importance of each token and retains only a subset of them. To minimize visual semantic loss, we introduce token Fusion and Insertion modules that supplement visual semantics by reusing discarded tokens and capturing global semantics. Based on this, our visual token sequence refinement strategy, referred to as RFI, is deployed at specific positions in the visual backbone to hierarchically compress the visual token sequence, thereby reducing the overall computational overhead of the model at its source. Extensive experiments demonstrate the effectiveness of the proposed method, showing that it can accelerate model inference without sacrificing performance. Additionally, the method allows for flexible trade-offs between accuracy and speed under different settings.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信