End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration

International Journal of Software and Informatics Pub Date : 2023-01-01 DOI:10.21655/ijsi.1673-7288.00316

Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao

{"title":"End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration","authors":"Jingkuan Song, Pengpeng Zeng, Jiayang Gu, Jinkuan Zhu, Lianli Gao","doi":"10.21655/ijsi.1673-7288.00316","DOIUrl":null,"url":null,"abstract":"PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by","PeriodicalId":479632,"journal":{"name":"International Journal of Software and Informatics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Software and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21655/ijsi.1673-7288.00316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

PDF HTML XML Export Cite reminder End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration DOI: 10.21655/ijsi.1673-7288.00316 Author: Affiliation: Clc Number: Fund Project: Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:To date, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, leading to a shift towards a fully end-to-end paradigm for multimodal downstream tasks such as image captioning, and enabling better performance and faster inference. However, the grid features extracted with the pre-trained model lack regional visual information, which leads to inaccurate descriptions of the object content by the model. Thus, the applicability of using pre-trained models for image captioning remains largely unexplored. Toward this goal, this paper proposes a novel end-to-end image captioning method based on Visual Region Aggregation and Dual-level Collaboration (VRADC). Specifically, to learn regional visual information, this paper designs a visual region aggregation that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, dual-level collaboration uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which in turn generates more fine-grained descriptions. Experimental results on the MSCOCO and Flickr30k datasets show that the proposed method, VRADC, can significantly improve the quality of image captioning, and achieves state-of-the-art performance. Reference Related Cited by

查看原文本刊更多论文

基于视觉区域聚合和双层协作的端到端图像字幕

PDF HTML XML导出引用提醒基于视觉区域聚合和双层协作的端到端图像字幕DOI: 10.21655/ijsi.1673-7288.00316作者:隶属单位:Clc编号:基金项目:摘要:迄今为止，基于transformer的预训练模型已经展示了强大的模态表示能力，导致向多模态下游任务(如图像字幕)的完全端到端范式转变，并实现了更好的性能和更快的推理。然而，使用预训练模型提取的网格特征缺乏区域视觉信息，导致模型对目标内容的描述不准确。因此，使用预训练模型进行图像字幕的适用性在很大程度上仍未得到探索。为此，本文提出了一种基于视觉区域聚合和双层协作(VRADC)的端到端图像字幕方法。具体而言，为了学习区域视觉信息，本文设计了一种视觉区域聚合方法，将语义相似的网格特征聚合在一起，得到紧凑的视觉区域表示。接下来，双级协作使用交叉注意机制从两个视觉特征中学习更具代表性的语义信息，进而生成更细粒度的描述。在MSCOCO和Flickr30k数据集上的实验结果表明，所提出的VRADC方法可以显著提高图像字幕的质量，达到最先进的性能。相关参考文献

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Software and Informatics

自引率

0.00%

发文量