Adaptive Attention Generation for Indonesian Image Captioning

2020 8th International Conference on Information and Communication Technology (ICoICT) Pub Date : 2020-06-01 DOI:10.1109/ICoICT49345.2020.9166244

Made Raharja Surya Mahadi, A. Arifianto, Kurniawan Nur Ramadhani

{"title":"Adaptive Attention Generation for Indonesian Image Captioning","authors":"Made Raharja Surya Mahadi, A. Arifianto, Kurniawan Nur Ramadhani","doi":"10.1109/ICoICT49345.2020.9166244","DOIUrl":null,"url":null,"abstract":"Image captioning is one of the most widely discussed topic nowadays. However, most research in this area generate English caption while there are thousands of language exist around the world. With their language uniqueness, there’s a need of specific research to generate captions in those languages. Indonesia, as the largest Southeast Asian country, has its own language, which is Bahasa Indonesia. Bahasa Indonesia has been taught in various countries such as Vietnam, Australia, and Japan. In this research, we propose the attention-based image captioning model using ResNet101 as the encoder and LSTM with adaptive attention as the decoder for the Indonesian image captioning task. Adaptive attention used to decide when and at which region of the image should be attended to produce the next word. The model we used was trained with the MSCOCO and Flick30k datasets besides. Both datasets are translated manually into Bahasa by human and by using Google Translate. Our research resulted in 0.678, 0.512, 0.375, 0.274, and 0.990 for BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CIDEr scores respectively. Our model also produces a similar score for the English image captioning model, which means our model capable of being equivalent to English image captioning. We also propose a new metric score by conducting a survey. The results state that 76.8% of our model’s caption results are better than validation data that has been translated using Google Translate.","PeriodicalId":113108,"journal":{"name":"2020 8th International Conference on Information and Communication Technology (ICoICT)","volume":"12 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 8th International Conference on Information and Communication Technology (ICoICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICoICT49345.2020.9166244","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Image captioning is one of the most widely discussed topic nowadays. However, most research in this area generate English caption while there are thousands of language exist around the world. With their language uniqueness, there’s a need of specific research to generate captions in those languages. Indonesia, as the largest Southeast Asian country, has its own language, which is Bahasa Indonesia. Bahasa Indonesia has been taught in various countries such as Vietnam, Australia, and Japan. In this research, we propose the attention-based image captioning model using ResNet101 as the encoder and LSTM with adaptive attention as the decoder for the Indonesian image captioning task. Adaptive attention used to decide when and at which region of the image should be attended to produce the next word. The model we used was trained with the MSCOCO and Flick30k datasets besides. Both datasets are translated manually into Bahasa by human and by using Google Translate. Our research resulted in 0.678, 0.512, 0.375, 0.274, and 0.990 for BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CIDEr scores respectively. Our model also produces a similar score for the English image captioning model, which means our model capable of being equivalent to English image captioning. We also propose a new metric score by conducting a survey. The results state that 76.8% of our model’s caption results are better than validation data that has been translated using Google Translate.

查看原文本刊更多论文

印尼语图像字幕的自适应注意力生成

图像字幕是当今讨论最广泛的话题之一。然而，这一领域的大多数研究都是生成英文字幕，而世界上有数千种语言。由于这些语言的独特性，需要专门的研究来生成这些语言的字幕。印度尼西亚是东南亚最大的国家，有自己的语言，即印尼语。印尼语在越南、澳大利亚和日本等许多国家都有教授。在本研究中，我们提出了基于注意的图像字幕模型，使用ResNet101作为编码器，使用自适应注意的LSTM作为解码器，用于印尼语图像字幕任务。自适应注意力用于决定何时以及在图像的哪个区域应该注意产生下一个单词。我们使用的模型还使用了MSCOCO和Flick30k数据集进行训练。这两个数据集都由人工和谷歌Translate手动翻译成印尼语。我们的研究结果表明，BLEU-1、BLEU-2、BLEU-3、BLEU-4和CIDEr得分分别为0.678、0.512、0.375、0.274和0.990。我们的模型也为英语图像字幕模型产生了类似的分数，这意味着我们的模型能够等同于英语图像字幕。我们还通过调查提出了一个新的度量分数。结果表明，76.8%的模型标题结果优于使用谷歌Translate翻译的验证数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 8th International Conference on Information and Communication Technology (ICoICT)

自引率

0.00%

发文量