Multiple Perspective Caption Generation with Attention Mechanism

2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI) Pub Date : 2020-09-01 DOI:10.1109/IIAI-AAI50415.2020.00031

H. Yanagimoto, Maaki Shozu

{"title":"Multiple Perspective Caption Generation with Attention Mechanism","authors":"H. Yanagimoto, Maaki Shozu","doi":"10.1109/IIAI-AAI50415.2020.00031","DOIUrl":null,"url":null,"abstract":"In caption generation, a caption generation system generates a caption, which describes the content of the image with natural language and needs to understand both an image and a text. So caption generation is an essential task in natural language processing and image processing. Many researchers recently pay attention to deep learning as a key technique to construct the caption generation system because deep learning can construct an intermediate representation, which is shared in both image processing and natural language processing. First, the system generates a feature from a given image with convolutional neural networks. Eventually, the system generates a word sequence, a caption, from the feature. It means that the system consists of two modules, the image processing module and the language model module and both of the modules are simultaneously trained with a training dataset. The deep learning based caption generation system is a blackbox system and it is difficult to collaborate with a human. So, we introduce attention mechanism in the caption generation system and we control caption generation by the attention weights. The usual caption generation systems can generate only a single caption from one image because the system can generate a caption from an image directly. The proposed system can generate some different captions form the same image because we can control the initial attention weights. Our results demonstrate how we collaborate with deep learning and the fact that the collaboration improves caption generation.","PeriodicalId":188870,"journal":{"name":"2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAI-AAI50415.2020.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

In caption generation, a caption generation system generates a caption, which describes the content of the image with natural language and needs to understand both an image and a text. So caption generation is an essential task in natural language processing and image processing. Many researchers recently pay attention to deep learning as a key technique to construct the caption generation system because deep learning can construct an intermediate representation, which is shared in both image processing and natural language processing. First, the system generates a feature from a given image with convolutional neural networks. Eventually, the system generates a word sequence, a caption, from the feature. It means that the system consists of two modules, the image processing module and the language model module and both of the modules are simultaneously trained with a training dataset. The deep learning based caption generation system is a blackbox system and it is difficult to collaborate with a human. So, we introduce attention mechanism in the caption generation system and we control caption generation by the attention weights. The usual caption generation systems can generate only a single caption from one image because the system can generate a caption from an image directly. The proposed system can generate some different captions form the same image because we can control the initial attention weights. Our results demonstrate how we collaborate with deep learning and the fact that the collaboration improves caption generation.

查看原文本刊更多论文

基于注意机制的多视角字幕生成

在标题生成中，标题生成系统生成一个标题，它用自然语言描述图像的内容，需要同时理解图像和文本。因此，标题生成是自然语言处理和图像处理中的一项重要任务。由于深度学习可以构建图像处理和自然语言处理中共享的中间表示，因此深度学习作为构建标题生成系统的关键技术受到了许多研究者的关注。首先，该系统使用卷积神经网络从给定图像中生成特征。最后，系统从特征中生成一个单词序列，一个标题。这意味着系统由两个模块组成，图像处理模块和语言模型模块，这两个模块同时使用一个训练数据集进行训练。基于深度学习的字幕生成系统是一个黑箱系统，难以与人进行协作。因此，我们在标题生成系统中引入了注意机制，并通过注意权值来控制标题的生成。通常的标题生成系统只能从一张图像生成单个标题，因为系统可以直接从一张图像生成标题。由于我们可以控制初始注意力权重，因此所提出的系统可以从同一图像生成不同的标题。我们的结果展示了我们如何与深度学习协作，以及协作改善了标题生成的事实。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI)

自引率

0.00%

发文量