基于视觉区域聚合和双层协作的端到端图像字幕

Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao
{"title":"基于视觉区域聚合和双层协作的端到端图像字幕","authors":"Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao","doi":"10.21655/ijsi.1673-7288.00316","DOIUrl":null,"url":null,"abstract":"PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by","PeriodicalId":479632,"journal":{"name":"International Journal of Software and Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration\",\"authors\":\"Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao\",\"doi\":\"10.21655/ijsi.1673-7288.00316\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by\",\"PeriodicalId\":479632,\"journal\":{\"name\":\"International Journal of Software and Informatics\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Software and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21655/ijsi.1673-7288.00316\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Software and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21655/ijsi.1673-7288.00316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

PDF HTML XML导出引用提醒基于视觉区域聚合和双层协作的端到端图像字幕DOI: 10.21655/ijsi.1673-7288.00316作者:隶属单位:Clc编号:基金项目:摘要:迄今为止,基于transformer的预训练模型已经展示了强大的模态表示能力,导致向多模态下游任务(如图像字幕)的完全端到端范式转变,并实现了更好的性能和更快的推理。然而,使用预训练模型提取的网格特征缺乏区域视觉信息,导致模型对目标内容的描述不准确。因此,使用预训练模型进行图像字幕的适用性在很大程度上仍未得到探索。为此,本文提出了一种基于视觉区域聚合和双层协作(VRADC)的端到端图像字幕方法。具体而言,为了学习区域视觉信息,本文设计了一种视觉区域聚合方法,将语义相似的网格特征聚合在一起,得到紧凑的视觉区域表示。接下来,双级协作使用交叉注意机制从两个视觉特征中学习更具代表性的语义信息,进而生成更细粒度的描述。在MSCOCO和Flickr30k数据集上的实验结果表明,所提出的VRADC方法可以显著提高图像字幕的质量,达到最先进的性能。相关参考文献
本文章由计算机程序翻译,如有差异,请以英文原文为准。
End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信