Retrieval-Augmented Transformer for Image Captioning

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2022-07-26 DOI:10.1145/3549555.3549585

Sara Sarto, Marcella Cornia, L. Baraldi, R. Cucchiara

引用次数: 23

Abstract

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

查看原文本刊更多论文

用于图像字幕的检索增强变压器

图像字幕模型旨在通过提供输入图像的自然语言描述来连接视觉和语言。在过去的几年里，这个任务已经通过学习参数模型和提出视觉特征提取的进步或通过建模更好的多模态连接来解决。在本文中，我们研究了一种具有kNN记忆的图像字幕方法的发展，该方法可以从外部语料库中检索知识，以帮助生成过程。我们的架构结合了一个基于视觉相似性的知识检索器、一个可微分编码器和一个knn增强的注意层，以基于过去的上下文和从外部存储器检索的文本来预测标记。在COCO数据集上进行的实验结果表明，使用显式外部存储器可以帮助生成过程并提高标题质量。我们的工作为在更大范围内改进图像字幕模型开辟了新的途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th International Conference on Content-based Multimedia Indexing

自引率

0.00%

发文量